CN113159051A

CN113159051A - Remote sensing image lightweight semantic segmentation method based on edge decoupling

Info

Publication number: CN113159051A
Application number: CN202110456921.9A
Authority: CN
Inventors: 段锦; 刘高天; 祝勇; 赵言; 王乐泉
Original assignee: Changchun University of Science and Technology
Current assignee: Changchun University of Science and Technology
Priority date: 2021-04-27
Filing date: 2021-04-27
Publication date: 2021-07-23
Anticipated expiration: 2041-04-27
Also published as: CN113159051B

Abstract

The invention discloses a remote sensing image lightweight semantic segmentation method based on edge decoupling, belongs to the field of computer vision, and can be used for intelligent interpretation in the field of remote sensing images. On one hand, the number of model parameters is reduced and the network calculation overhead is reduced through a Ghost bottleneck module and deep separable convolution, the efficiency of semantic segmentation of the remote sensing image is effectively improved, and the lightweight of the proposed semantic segmentation network is realized; on the other hand, the precision of semantic segmentation is improved through the multi-scale feature pyramid, the global context module and the edge decoupling module, so that the proposed lightweight semantic segmentation network can accurately and efficiently realize the semantic segmentation of the remote sensing image, and the edge details of the remote sensing image are further refined.

Description

Remote sensing image lightweight semantic segmentation method based on edge decoupling

Technical Field

The invention belongs to the field of computer vision, and particularly relates to a light-weight semantic segmentation method for a remote sensing image based on edge decoupling, which can be used for intelligent interpretation in the field of remote sensing images.

Background

The high-resolution remote sensing image contains information such as detailed color and texture characteristics of targets such as roads and buildings, and the intelligent interpretation of the information has important significance in various fields such as military, agriculture and environmental science. To accomplish the task of analytically classifying the remote sensing image, each pixel in the image should be assigned a label associated with the class to which it belongs, which is consistent with the purpose of semantic segmentation of the image.

The task has a better development direction under the inspiration of deep learning, and particularly, the full convolution network is provided, so that the image semantic segmentation method based on the deep learning is mainstream. Methods such as UNet, SegNet, PSPNet, Deeplab series and the like appear, and the methods have more advantages compared with the traditional remote sensing image segmentation algorithm. When the algorithms are applied to semantic segmentation of high-resolution remote sensing images, although the algorithms can ensure a relatively excellent segmentation effect, the training speed and the segmentation efficiency are lower due to the fact that the image size is relatively large and the network structure is complex. In addition, the remote sensing image has too rich target diversity, unbalanced target class distribution and easy edge overlapping among different target classes, so that the remote sensing image cannot be finely divided.

Disclosure of Invention

The invention aims to provide a remote sensing image lightweight semantic segmentation method based on edge decoupling, and the method is used for solving the technical problems that the inference speed is low and the segmentation efficiency is low due to large parameter quantity and large calculation overhead when the existing semantic segmentation method faces a high-resolution remote sensing image, and the edge segmentation effect is not ideal due to the fact that the edges of different classes of targets are easy to overlap.

In order to achieve the purpose, the invention provides a remote sensing image light-weight semantic segmentation method based on edge decoupling, which has the following specific technical scheme:

a remote sensing image lightweight semantic segmentation method based on edge decoupling comprises the steps of building, training and testing a semantic segmentation network, wherein the semantic segmentation network is a lightweight coding and decoding network with a double-branch structure, after training of the semantic segmentation network is completed based on training samples, a remote sensing image to be tested is input into the semantic segmentation network, and a final remote sensing image semantic segmentation result is output;

the method comprises the following steps in sequence:

step 1, acquiring a remote sensing image data set, and preparing a training and testing sample;

step 2, building a lightweight coding and decoding semantic segmentation network with a double-branch structure;

step 3, inputting the training samples into an encoder, performing feature encoding through feature extraction, and obtaining an encoding feature map F_E；

Step 4, encoding characteristic graph F_EInputting the data into a decoder, performing edge feature refinement processing and up-sampling operation to obtain a decoded feature map F_D；

Step 5, inputting the decoding characteristic graph into a classifier to perform pixel level classification prediction, outputting a segmentation result, and performing supervised training on the semantic segmentation network through a supervision mechanism;

step 6, training the semantic segmentation network built in the step 2 by using the training sample according to the step (3-5);

and 7, inputting the sample to be tested into the trained semantic segmentation network, outputting a final remote sensing image semantic segmentation result, and completing the test of the semantic segmentation network.

Further, the step 2 builds a lightweight coding and decoding semantic segmentation network with a double-branch structure, and the network comprises an encoder, a decoder and a classifier;

the encoder is of a double-branch structure and comprises a global downsampling block, a lightweight double-branch sub-network and a global feature fusion module;

the decoder consists of a lightweight edge decoupling module and an up-sampling module;

the classifier is composed of a conventional convolutional layer and a SoftMax layer.

Further, the coding feature map F is obtained in the step 3_EThe method comprises the following steps:

step 3.1, inputting the training sample into a global downsampling block of an encoder to obtain a low-level feature map;

step 3.2, inputting the low-level feature map into a lightweight double-branch subnetwork in the encoder to obtain a space detail feature map and an abstract semantic feature map;

step 3.3, the obtained space detail feature map and the abstract semantic feature map are subjected to multi-level feature fusion through a global feature fusion block of the encoder, and an encoding feature map F is output_E。

Further, the global downsampling block in step 3.1 is composed of 3 parts, one of which is 1 conventional convolution, the other of which is 1 Ghost bottleneck module, and the third of which is 1 global context module;

after the input samples pass through the global downsampling block, a low-level feature map with the output resolution of the original input 1/4 is generated and used as the input of the subsequent process.

Further, the step 3.2 lightweight dual-branch subnetwork includes two branches, which are a trunk depth branch for acquiring abstract semantic features and a space holding branch for acquiring space detail features, respectively, and share a low-level feature map output by the global downsampling block;

the trunk deep branch is constructed based on a GhostNet feature extraction network and comprises two structures, wherein one structure is a branch main body structure consisting of 16 GhostNet bottleneck modules and used for carrying out a downsampling process for 4 times to extract deep features; the structure comprises four parts of depth separable convolution, an up-sampling block, a lightweight cavity space pooling pyramid module and element fusion, 4 deep feature maps with different scales formed by a main body are used as input, and finally, abstract semantic features with enlarged receptive field and multi-scale information are output;

the spatial preservation branch consists of 3 depth separable convolutions, 1 downsampling is achieved for the low-level features of the input, and the resolution of the output spatial detail feature map is 1/2 of the input.

Further, the step 3.3 global feature fusion module includes 3 parts, one of which is a depth separable convolution with two parallel convolution kernels being 1 × 1; second, element fusion; thirdly, a global context module;

performing dimension adjustment on the input abstract semantic features and space detail features through two parallel convolutions, outputting feature graphs with rich space details and abstract semantic information through element fusion, and finally performing lightweight context modeling through a global context module to finally form a coding feature graph F_EThe global information can be better fused.

Further, the step 4 obtains a decoding feature map F_DFirstly, the coding feature map F_EInputting the data into a lightweight edge decoupling module of a decoder, and performing edge feature refinement processing to generate a fine feature map with refined edges; inputting the fine characteristic diagram into an up-sampling module of a decoder, carrying out up-sampling operation, restoring the fine characteristic diagram to the size of the original input remote sensing image, and taking the restored fine characteristic diagram as a decoding characteristic diagram F output by the decoder_D。

Further, the lightweight edge decoupling module consists of 3 parts, namely a lightweight cavity space pooling pyramid, a main body feature generator and an edge retainer; firstly, the coding characteristics are subjected to light-weight cavity space pooling pyramid to generate a characteristic diagram F with multi-scale information and a larger receptive field_asppThen generating more consistent feature representation for pixels in the same object through a subject generator, and further forming a subject feature graph F of the target object_body(ii) a F is to be_body、F_asppAnd F_EInputting the data into an edge holder, and outputting a feature map F of a refined edge through explicit subtraction operation, channel stack fusion and 1 x 1 conventional convolution dimensionality reduction_edgeFinally, the main body characteristic diagram and the refined edge characteristic diagram are fused, and a refined output characteristic diagram for performing up-sampling recovery is output and is marked as F_final(ii) a The whole process can be represented by the following formula：

In the formula (f)_dsasppRepresenting a lightweight cavity space pooling pyramid function, phi representing a principal feature generating function,

an edge-preserving function;

the up-sampling module comprises two steps of 1 × 1 conventional convolution operation and up-sampling operation, and a fine feature map F_finalAfter being output by the module, the characteristic diagram is restored to have the size of the original input remote sensing image, namely the output characteristic diagram F of the decoder_D。

Further, the supervision mechanism in step 5, decoding the feature map F_DAfter the classification of the pixel level is completed by the classifier, the output is the result of semantic segmentation, and the network is supervised and trained by a supervision mechanism formed by the result of semantic segmentation and a real label, so that the semantic segmentation network achieves the best segmentation performance.

Further, the supervision mechanism in step 5 is an edge-based supervision method, and the mechanism is implemented by a designed loss function, where the total loss function is denoted as L, and its formula is shown as follows:

in the formula L_body、L_final、L_edge、L_GRespectively representing the main feature loss, the edge feature loss, the fine feature loss and the global coding loss, and the input of the 4 loss functions is respectively: respectively forming segmentation results and corresponding real labels of the segmentation results after the main body characteristic diagram, the refined edge characteristic diagram, the refined output characteristic diagram and the coding characteristic diagram are subjected to up-sampling recovery and a SoftMax layer;

wherein the loss function L_edgeIs based onThe edge prediction part acquires a comprehensive loss function of boundary edge prior, and the comprehensive loss function comprises two aspects: one is the binary cross entropy loss L for boundary pixel classification_bceAnd secondly, the cross entropy loss L of the edge part in the scene_bce，λ₁、λ₂、λ₃、λ₄、λ₅、λ₆The representative hyperparameter is used to control the weighting between several losses.

The method of the invention has the following advantages: the method can fully consider the total parameter quantity and the total calculation quantity of the semantic segmentation network, the influence of a large number of redundant features on the segmentation efficiency and the segmentation accuracy, and fully consider the effect of the relation between the target body and the edge on the refined segmentation result;

firstly, the invention combines the idea of feature sharing, designs a global downsampling block based on a global context module and a Ghost bottleneck module, and provides a first part of an encoder in a semantic segmentation network as the method, thereby effectively reducing the parameter scale of extracting low-level features in the early stage of the network, reducing the calculation cost, and better fusing global context information in the low-level features.

Secondly, the invention combines a double-branch structure and a global feature fusion mode based on a global context module, firstly, a lightweight double-branch sub-network is built based on a Ghost bottleneck module and a deep separable convolution, the parameter scale of a feature extraction stage is obviously reduced, the calculation complexity is reduced, and finally the output coding features contain rich space details and abstract semantic information. And secondly, the output characteristics of the double branches are fused in a global characteristic fusion mode based on a global context module, so that the finally output coding characteristics deepen the understanding of global information and the loss of the network to weak characteristic information is reduced.

Thirdly, the light-weight edge decoupling module is built by using the depth separable convolution, and the relation between the object body and the edge of the object body is introduced by modeling the main body and the edge of the target object; the problem that the existing remote sensing image semantic segmentation algorithm is not fine in edge segmentation is effectively solved, and the segmentation effect of edge details in the remote sensing image is improved.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

FIG. 2 is a schematic diagram of a semantic segmentation network structure constructed by the method of the present invention.

Fig. 3 is a schematic structural diagram of a Ghost bottleneck module.

FIG. 4 is a diagram illustrating a global context module structure.

Fig. 5 is a schematic view of a lightweight feature pyramid structure in a trunk feature extraction branch.

Fig. 6 is a schematic structural view of a lightweight edge decoupling module.

Fig. 7 is an exemplary diagram of a remote sensing image in a data set and corresponding semantic tags.

FIG. 8 is a graph comparing semantic segmentation results according to an embodiment of the method of the present invention. (wherein (a) and (b) are input samples and corresponding labels, and (c) to (g) are semantic segmentation result graphs of Fast-SCNN, Sem-FPN, the method of the invention, UNet and PSPNet in sequence).

Detailed Description

In order to better understand the purpose, structure and function of the invention, the following describes a remote sensing image lightweight semantic segmentation method based on edge decoupling in further detail with reference to the accompanying drawings.

As shown in FIG. 1, the invention designs a remote sensing image lightweight semantic segmentation method based on edge decoupling, which is applied to a high-resolution remote sensing image, so that the edge segmentation effect is refined while the precision is ensured, and the segmentation efficiency is greatly improved.

As shown in fig. 1, the invention relates to a remote sensing image light-weight semantic segmentation method based on edge decoupling, which specifically comprises the following steps:

firstly, a high-resolution remote sensing image data set containing semantic labels is obtained, labels and data are correspondingly cut, and the cutting mode is that sliding window cutting with sliding step length of 384 is carried out according to the coverage rate of 0.75 and a fixed window with 512-512 resolution. The cut new data and the new label are corresponding, data amplification is carried out by using modes of rotation, color enhancement and the like, and the influence of overfitting can be effectively weakened by sufficient samples. Finally, dividing samples of the training set and the test set according to the ratio of 4: 1;

the structure of the constructed semantic segmentation network is shown in FIG. 2. The network is a lightweight coding and decoding semantic segmentation network with a double-branch structure and comprises an encoder, a decoder and a classifier. The encoder is of a double-branch structure and comprises a global downsampling block, a lightweight double-branch sub-network and a global feature fusion module; the decoder is composed of a lightweight edge decoupling module and an up-sampling module. The classifier is composed of a conventional convolution layer and a SoftMax layer;

step 3, inputting the training samples into an encoder, performing feature encoding through feature extraction, and obtaining an encoding feature map F_E(ii) a The following three substeps are specifically experienced:

the obtained training samples are input into an encoder of the network, the input samples have a scale of 512 × 512, and the samples first pass through a global downsampling block in the encoder. The module consists of 3 parts, one is conventional convolution, the other is 1 Ghost bottleneck module, and the third is a global context module. The low-level feature map output by the global downsampling module is better fused with global context information and also contains rich spatial detail information;

the conventional convolution is a convolution block with a convolution kernel of 3 × 3 and a step size of 2 and provided with a batch normalization layer and a ReLU activation layer, and the training sample outputs a feature map with the resolution of 256 × 256 after being partially sampled;

the Ghost bottleneck module is a lightweight module, originates from a Ghost Net network, is composed of Ghost modules, can generate a feature map with deeper dimensions by using fewer parameters, and has a structure shown in FIG. 3. The module has different structures according to different step lengths, the structure contains two layers of Ghost modules with the step length of 1 when the step length is 1, and the two layers of Ghost modules are mixed with one channel-by-channel convolution with the step length of 2 when the step length is 2. The step size of this block in the global downsampling block is 2, and the implementation of the second downsampling further reduces the resolution of the feature map. The resolution of the feature map output by the module is 128 × 128;

the global context module, the structure of which is shown in fig. 4, includes 3 processes: one is a global attention focusing mechanism for context modeling; acquiring a self-attention weight of an input feature map by adopting a 1 x 1 conventional volume and a SoftMax layer, and then performing attention focusing operation on the input feature map to acquire a global background feature map; secondly, acquiring channel dependence by feature conversion; the part consists of two 1-by-1 convolution layers which are connected through a batch normalization layer and a ReLu activation function; thirdly, element fusion; and fusing the original input feature graph and the feature graph after obtaining channel dependence through element fusion, and aggregating the global context features to the features of each position. The output feature map and the input feature map keep the same size after passing through the module, so that the scale of the finally formed low-level feature map is 128 x 128;

the two-branch subnetwork comprises two branches, namely a trunk depth branch for acquiring abstract semantic features and a space holding branch for acquiring space detail features. Two branches share the low-level features output by the global downsampling block, compared with the traditional two-branch network, one input path is reduced, and the parameter scale and the calculation cost when the low-level features are extracted in the early stage of the network are reduced;

the trunk deep branch is constructed based on a GhostNet network, and the main body part of the network comprises 16 GhostNet bottleneck modules for carrying out a downsampling process for 4 times so as to realize extraction of deep features. The method reserves 16 Ghost bottleneck modules in the Ghost Net and changes the Ghost bottleneck modules into a full convolution network as a main body of the trunk deep branch. The input low-level feature map is processed by the branch, and finally, 4 depth feature maps with the scales of 64 × 64, 32 × 32, 16 × 16 and 8 × 8 are generated. The 4 scales represent 4 stages, and the number of Ghost bottleneck modules corresponding to each stage is respectively: [3, 2, 6, 5], corresponding convolution kernel sizes are: [3,5,3,5]. Since 4 downsampling is implemented, each stage has a Ghost bottleneck module with a step size of 2.

Meanwhile, in order to obtain rich abstract semantic features, the method combines a depth separable convolution module, an up-sampling block module and a lightweight cavity space pooling pyramid module, and builds a lightweight feature pyramid by using 4 feature maps, wherein the structure of the pyramid is shown in fig. 5. The newly generated 4 levels are closely related, the receptive field is increased, the characteristic diagram with multi-scale information is up-sampled to 64 x 64 scales, and a final abstract semantic characteristic group formed by element fusion is output of the trunk depth branch;

the space-preserving branch is formed by 3 depth separable convolutions, the convolution kernel sizes of the three are all 3 x 3, and the step sizes are [1, 2 and 1] respectively. 1-time down-sampling can be realized on the input low-level feature map, and the resolution of the output spatial detail feature map is 64 x 64; the branch reserves the space scale of the input image with less parameter quantity and less calculation cost, and can encode rich space information;

step 3.3, the obtained space detail feature map and the abstract semantic feature map are subjected to multi-level feature fusion through a global feature fusion block of the encoder, and an encoding feature map F is output_E；

The global feature fusion module comprises 3 parts, wherein one part is a depth separable convolution with two parallel convolution kernels being 1 x 1; second, element fusion; thirdly, a global context module; the abstract semantic features and the space detail features from the double-branch sub-network are subjected to dimension adjustment through two parallel 1-x-1 convolutions, and a feature diagram with rich space details and abstract semantic information is output through element fusion. Finally, carrying out lightweight context modeling through a global context module to finally form a coding feature graph F_EThe global information can be better fused. Because the downsampling process is not included, the output coding feature map is consistent with the input size;

finally, generating a coding feature map with the dimension of 64 x 64 by the coder through the steps, and using the coding feature map as the input of the subsequent process;

The decoder consists of two modules, namely an edge decoupling module and an up-sampling module. The lightweight edge decoupling module comprises 3 parts, and the structure is shown in figure 6. The device comprises a lightweight cavity space pooling pyramid, a main body feature generator and an edge retainer; the specific process of acquiring the decoding characteristic diagram comprises the following steps: firstly, encoding characteristic graph F_EInputting the data into a lightweight edge decoupling module of a decoder, and performing edge feature refinement processing to generate a fine feature map with refined edges; inputting the fine characteristic diagram into an up-sampling module of a decoder, carrying out up-sampling operation, restoring the fine characteristic diagram to the size of the original input remote sensing image, and taking the restored fine characteristic diagram as a decoding characteristic diagram F output by the decoder_D；

The main body feature generator comprises two processes of flow field generation and feature deformation, wherein the flow field generation is composed of a micro coding and decoding structure containing one-time down-sampling and one-time up-sampling and a conventional convolution with a convolution kernel of 3 x 3 and is used for generating flow field feature representation with prominent features at the central part of a target object. The characteristic deformation is to obtain the obvious main characteristic representation of the target object by carrying out deformation operation on the flow field characteristics; therefore, the main feature generator is responsible for generating more consistent feature representation for pixels in the same object, and the extracted main feature is the main feature of the target object;

the edge retainer comprises two steps, wherein the first step is that a subtracter is used for carrying out display subtraction operation on the coding feature map subjected to the receptive field expansion and the main body feature map to obtain a rough edge feature map; the second is an edge feature refiner, which supplements edge features with a low-level feature map containing fine details. Specifically, the low-level feature map from the encoder is feature-fused to the generated coarse edge feature map through channel stacking, thereby supplementing high-frequency information. Performing dimensionality reduction through 1-by-1 conventional convolution, and outputting a feature map of a refined edge;

input coding feature F_EFirstly, generating a feature map F with multi-scale information and larger receptive field through a lightweight cavity space pooling pyramid_asppThen, a subject feature graph F of the target is formed through a subject generator_body(ii) a F is to be_body、F_asppAnd F_EInputting into an edge holder, first F_bodyAnd F_EPerforming explicit subtraction to generate a preliminary edge feature map, and then combining the preliminary edge feature map with F_EPerforming channel stacking fusion, performing 1 × 1 conventional convolution operation dimensionality reduction processing to output an edge feature graph F with refined edges_edge. Finally F_bodyAnd F_ECarrying out element fusion to obtain a fine output characteristic diagram F for carrying out up-sampling recovery_final(ii) a The whole process can be represented by the following formula:

an edge-preserving function;

to obtain the final decoded feature map F_DWill F_finalInputting into an up-sampling module containing 1-by-1 conventional convolution and up-sampling operation for processing, recovering to the original input image size, and finally generating a feature map which is a decoding feature map F_DIts scale is 512 x 512;

the classifier main body is a SoftMax layer, and a decoding characteristic diagram F_DAfter the processing of the SoftMax layer, the classification prediction of the pixel level is completed to obtainThe result of the semantic segmentation is obtained. The network training is supervised by a supervision mechanism formed by the segmentation result and the real label, so that the semantic segmentation network achieves the optimal segmentation performance;

the supervision mechanism does not supervise only the final segmentation result, but for F_body、F_edge、F_final、F_EThe four parts are jointly monitored. The mechanism is realized by a designed loss function, and the total loss function is recorded as L, and the formula is shown as the following formula:

in the formula L_body、L_final、L_edge、L_GRespectively representing the loss of main features, the loss of edge features, the loss of fine features and the loss of global coding. Wherein L is_finalAnd L_GThe cross entropy loss function which is common in the semantic segmentation task is adopted. L is_bodyIt is assumed that the loss is lost by boundary relaxation, which can make F during training_bodyThe classification of the boundary pixels can be relaxed, and the segmentation network is allowed to predict the boundary pixels into a plurality of classes. L is_edgeThe method is based on the edge prediction part to obtain the comprehensive loss function of the boundary edge prior, and comprises two aspects: one is the binary cross entropy loss L for boundary pixel classification_bce. Second is cross entropy loss L of edge parts in a scene_ce。λ₁、λ₂、λ₃、λ₄、λ₅、λ₆The representative hyperparameter is used to control the weighting between several losses. The first three defaults are 1, and the last three are 0.4, 20, 1, respectively.

Represents the real semantic label of the real object,

is represented by

The generated binary mask, b represents the boundary prediction result, s_body、s_finalAnd s_ERepresents from F_body、F_edge、F_EThe segmentation map prediction result obtained in (1);

according to the process, after a semantic segmentation network is built, training samples are continuously input to train the network according to the step (3-5); before training, related training parameters such as input scale of the network, input sample batch, learning rate and the like need to be set.

Step 7, inputting a sample to be tested into the trained semantic segmentation network, outputting a final semantic segmentation result of the remote sensing image, and completing the test of the semantic segmentation network;

the following is a specific example experiment, which is not intended to limit the use of the method of the present invention, but is merely a better example for analysis.

The experiment used the Vaihingen dataset provided by isps, which contains 3 channels of IRRG images, DSM images, and NDSM images. 16 remote sensing images with the size of 6000 x 6000 and corresponding labels. The corresponding visualization results are shown in fig. 7. The corresponding semantic labels of the 6 types of target classes contained in the label are determined by RGB values, which are specifically shown in table 1 below:

TABLE 1 semantic annotation information Table

In this embodiment, the data set is preprocessed according to the sliding window cropping method and the data augmentation described in step 1, and the obtained data is a multi-channel graph of 512 × 3. Dividing the training sample and the test sample according to the proportion of 4: 1;

then, a semantic segmentation network of the method is built, and related parameters need to be set before training. The input scale of the network is 512 × 512, the input batch is set to 10 (according to video memory), the optimizer adopts an SGD optimizer, the initial learning rate is set to 0.001, the minimum learning rate is set to 0.00001, the momentum is set to 0.9, and the weight attenuation coefficient is set to 0.0005.

In this embodiment, the selected semantic segmentation evaluation indexes are an average pixel intersection ratio (mlou), an average pixel precision (mAcc), GFLOPs (floating point computation), a parameter quantity, and a segmentation inference time of a single image. Selecting 4 semantic segmentation methods for comparing the method from two aspects of segmentation precision and efficiency, wherein the 4 semantic segmentation methods respectively comprise the following steps: UNet, PSPNet, Fast _ SCNN, Sem-FPN. And utilizing mIoU and mAcc as a standard for measuring the semantic segmentation accuracy. The higher the two is, the closer the segmentation result is to the real label is, and the higher the semantic segmentation precision is. GFLOPs, parameter numbers and inference time of single image segmentation are used as standards of semantic segmentation efficiency, and the smaller the GFLOPs, parameter numbers and inference time, the higher the segmentation efficiency. The experimental results of the different semantic segmentation methods are shown in table 2:

TABLE 2 comparison of the present Process with the existing Process

Method	mIoU(％)	mAcc(％)	GFLOPs	Reference quantity (M)	Partition reasoning time(s)
						UNet	86.19	91.16	203.04	29.06	0.067
PSPNet	86.40	92.19	178.48	48.98	0.066
						Fast-SCNN	76.23	83.83	0.91	1.21	0.015
Sem-FPN	83.57	90.91	45.48	28.50	0.029
						Method for producing a composite material	85.33	90.98	6.63	4.17	0.031

As can be seen from the results in Table 2, the method achieved 89.42% mIoU and 93.15% mAcc, GFlOPs was 6.9, the number of parameters was 4.1M, and the single image segmentation inference speed was 0.031 s. Compared with Fast-SCNN, the precision is far lower than that of the method, although the parameter quantity is minimum, the floating point calculation quantity is minimum, and the reasoning time is shortest; compared with Sem-FPN, although the method is slightly inferior in inference time, the method is higher than Sem-FPN in mIoU and mACC, and the parameter quantity and GFLOPs of Sem-FPN are far higher than the method. Both UNet and PSPNet are classical semantic segmentation networks, and compared with the method, the UNet and PSPNet are inferior to the two in precision, but the UNet and PSPNet are several times of the method in parameter quantity, GFLOPs and inference time. Therefore, the semantic segmentation network provided by the method is superior to other semantic segmentation networks in terms of the precision and efficiency of the semantic segmentation. In addition, from the view of parameter quantity, GFLOPs and reasoning speed, the invention is verified to be a light-weight semantic segmentation method for remote sensing images;

fig. 8 is a visualized semantic segmentation result obtained after a test sample is input. Compared with semantic segmentation results obtained by Fast-SCNN and Sem-FPN methods, the method is more accurate in pixel classification, and effectively improves the condition of segmentation errors caused by wrong classification; the method is more accurate in processing edge details and is closer to the real result of the semantic label. Compared with UNet and PSPNet, although the overall segmentation precision is insufficient, the method is closer to semantic labels in segmentation at edge details.

It is to be understood that the present invention has been described with reference to certain embodiments, and that various changes in the features and embodiments, or equivalent substitutions may be made therein by those skilled in the art without departing from the spirit and scope of the invention. In addition, many modifications may be made to adapt a particular situation or material to the teachings of the invention without departing from the essential scope thereof. Therefore, it is intended that the invention not be limited to the particular embodiment disclosed, but that the invention will include all embodiments falling within the scope of the appended claims.

Claims

1. A remote sensing image lightweight semantic segmentation method based on edge decoupling is characterized by comprising the steps of building, training and testing a semantic segmentation network, wherein the semantic segmentation network is a lightweight coding and decoding network with a double-branch structure, after training of the semantic segmentation network is completed based on a training sample, a remote sensing image to be tested is input into the semantic segmentation network, and a final remote sensing image semantic segmentation result is output;

the method comprises the following steps in sequence:

2. The remote sensing image light-weight semantic segmentation method based on edge decoupling as claimed in claim 1, wherein step 2 builds a light-weight codec semantic segmentation network with a double-branch structure, and the network comprises an encoder, a decoder and a classifier;

3. The remote sensing image light-weight semantic segmentation method based on edge decoupling as claimed in claim 1, wherein the encoding feature map F obtained in step 3_EThe method comprises the following steps:

4. The remote sensing image light-weight semantic segmentation method based on edge decoupling as claimed in claim 3, wherein the global downsampling block in step 3.1 is composed of 3 parts, one is 1 conventional convolution, the other is 1 Ghost bottleneck module, and the third is 1 global context module;

5. The remote sensing image light-weight semantic segmentation method based on edge decoupling as claimed in claim 3, wherein the light-weight double-branch sub-network in step 3.2 comprises two branches, namely a trunk depth branch for obtaining abstract semantic features and a space preservation branch for obtaining space detail features, and the two branches share a low-level feature map output by a global downsampling block;

the spatial preservation branch is formed by 3 depth separable convolutions, with 1 downsampling being performed on the input low-level feature map, and the output spatial detail feature map being at the resolution of the input 1/2.

6. The remote sensing image light-weight semantic segmentation method based on edge decoupling as claimed in claim 3, wherein the step 3.3 global feature fusion module comprises 3 parts, one of which is a depth separable convolution with two parallel convolution kernels of 1 x 1; second, element fusion; thirdly, 1 global context module;

7. The remote sensing image light-weight semantic segmentation method based on edge decoupling as claimed in claim 1, wherein the step 4 obtains a decoding feature map F_DFirstly, the coding feature map F_EInputting the data into a lightweight edge decoupling module of a decoder, and performing edge feature refinement processing to generate a fine feature map with refined edges; inputting the fine characteristic diagram into an up-sampling module of a decoder, carrying out up-sampling operation, restoring the fine characteristic diagram to the size of the original input remote sensing image, and taking the restored fine characteristic diagram as a decoding characteristic diagram F output by the decoder_D。

8. The remote sensing image light-weight semantic segmentation method based on the edge decoupling as claimed in claim 2, wherein the light-weight edge decoupling module is composed of 3 parts, namely a light-weight cavity space pooling pyramid, a main body feature generator and an edge retainer; firstly, the coding characteristics are subjected to light-weight cavity space pooling pyramid to generate a characteristic diagram F with multi-scale information and a larger receptive field_asppThen generating more consistent feature representation for pixels in the same object through a subject generator, and further forming a subject feature graph F of the target object_body(ii) a F is to be_body、F_asppAnd F_EInputting the data into an edge holder, and outputting a feature map F of a refined edge through explicit subtraction operation, channel stack fusion and 1 x 1 conventional convolution dimensionality reduction_edgeFinally, the main body characteristic diagram and the refined edge characteristic diagram are fused, and a refined output characteristic diagram F for carrying out up-sampling recovery is output_final(ii) a The overall process can be represented by the following formula:

an edge-preserving function;

9. The remote sensing image light-weight semantic segmentation method based on edge decoupling as claimed in claim 1, wherein the supervision mechanism in step 5 is decoding feature map F_DAfter the classification of the pixel level is finished, the classification prediction of the pixel level is finished and outputThe semantic segmentation result is the result of semantic segmentation, and the network is supervised and trained through a supervision mechanism formed by the semantic segmentation result and a real label, so that the semantic segmentation network achieves the optimal segmentation performance.

10. The remote sensing image light-weight semantic segmentation method based on edge decoupling as claimed in claim 1, wherein the supervision mechanism in step 5 is an edge-based supervision mode, the mechanism is realized by a designed loss function, the total loss function is denoted as L, and the formula is shown as follows:

wherein the loss function L_edgeThe method is based on the edge prediction part to obtain the comprehensive loss function of the boundary edge prior, and comprises two aspects: one is the binary cross entropy loss L for boundary pixel classification_bceAnd secondly, the cross entropy loss L of the edge part in the scene_bce，λ₁、λ₂、λ₃、λ₄、λ₅、λ₆The representative hyperparameter is used to control the weighting between several losses.