CN117152435A

CN117152435A - Remote sensing semantic segmentation method based on U-Net3+

Info

Publication number: CN117152435A
Application number: CN202311135160.2A
Authority: CN
Inventors: 王士奇; 曲达明; 黄艳金
Original assignee: China Forestry Star Beijing Technology Information Co ltd
Current assignee: China Forestry Star Beijing Technology Information Co ltd
Priority date: 2023-09-05
Filing date: 2023-09-05
Publication date: 2023-12-01

Abstract

A remote sensing semantic segmentation method based on U-Net3+ relates to the field of remote sensing image processing, and comprises the following steps: data acquisition and preprocessing; constructing a U-Net3+ segmented network model, and adding a multi-scale feature extraction module and an attention mechanism into the constructed U-Net3+ segmented network model; constructing an improved mixed loss function, and applying the improved mixed loss function to the constructed U-Net3+ split network model; the preprocessed data is transmitted to the constructed U-Net3+ segmentation network model to carry out model training; carrying out semantic segmentation on the remote sensing image by using the trained U-Net3+ segmentation network model, and verifying the segmentation effect of the U-Net3+ segmentation network model; and (5) image post-processing. The invention improves the segmentation precision of remote sensing semantic segmentation, reduces the complexity of a network model, reduces the calculated amount and is beneficial to the subsequent arrangement implementation.

Description

Remote sensing semantic segmentation method based on U-Net3+

Technical Field

The invention relates to the technical field of remote sensing image processing, in particular to a U-Net3+ based remote sensing semantic segmentation method.

Background

The remote sensing technology is a science and technology for acquiring the earth surface and atmospheric information through satellites, spacecrafts or other remote sensing equipment. These remote sensing devices can collect electromagnetic radiation data in different wavebands (e.g., visible, infrared, microwave, etc.) and convert it into digital images or other data formats for use in studying, monitoring, measuring, and managing the natural and artificial features of the earth's surface. Remote sensing technology is widely used in a plurality of fields such as industry, agriculture, forestry, military and the like. In the aspect of urban planning, the remote sensing technology helps to know information such as land utilization conditions, traffic flow and the like of cities, and provides important references for planning. In resource exploration, new mineral resources, oil-gas fields and the like can be discovered through a remote sensing technology, and effective development and utilization of the resources are promoted. In the aspect of land monitoring, the remote sensing technology can be used for monitoring and evaluating the growth condition of crops, forest coverage and the like. In the military field, remote sensing technology is often used for military information acquisition, target positioning and identification, topography analysis and mapping, combat mission planning, and the like.

In recent years, remote sensing technology is rapidly developed, but remote sensing image processing technology is relatively lagged. The intelligent efficient extraction of valuable information is one of the important problems to be solved in the remote sensing field, which is in need of ever-increasing data and various image categories. The remote sensing image accurate segmentation can realize mapping basic functions such as earth surface coverage information extraction, environment change detection and the like. The traditional remote sensing image segmentation method is affected by multiple factors such as image quality, illumination conditions, shielding conditions and the like, and has poor segmentation precision. With the rapid development of deep learning technology, in particular to the application of convolutional neural networks, the image semantic segmentation has significantly progressed in remote sensing image processing.

The U-Net network achieves remarkable results in the image segmentation task through a special coding and decoding structure. The U-Net network utilizes the encoder part to extract the characteristics, also utilizes the decoder part to reconstruct the pixel level, and combines the high-level characteristics extracted from the encoder with the low-level characteristics in the decoder through jump connection, so that the network can utilize multi-level characteristic information, thereby realizing more accurate image segmentation. The U-Net++ network adopts nested and dense long connection to grab different layers of characteristics for characteristic superposition, so that the semantic gap between an encoder and a decoder is reduced. The U-Net3+ is connected through full-scale jump, and large-scale, same-scale and small-scale features of the encoder and the decoder are fused, so that rich low-level semantic features and high-level semantic features are obtained. However, the U-Net, U-Net++, and U-Net3+ splice and fuse the low-level semantic features and the high-level semantic features, so that a large amount of redundant information can be generated, and the network cannot pay attention to the segmentation target better; and meanwhile, the multi-level feature fusion is used, so that a large amount of computing resources are occupied, and the arrangement and implementation of the algorithm are not facilitated.

The existing remote sensing semantic segmentation has the following defects:

1) The remote sensing image segmentation precision is low: the remote sensing image has complex content and large imaging range, wherein the texture is complex, the geometric forms of the ground features are changeable, the different ground features are distributed in an intricate manner, and the ground feature boundaries are easy to be confused. Secondly, the remote sensing images are rich in ground object content, variable in scale size and large in scale span, and differences exist in color trend and the like, so that difficulty is brought to segmentation.

2) Data category imbalance: the number of pixels of different types of ground objects in the remote sensing image is larger than the common difference, and the small types cannot be sufficiently trained, so that the recognition accuracy is affected.

3) The calculated amount is large, and the occupied resources are high: when the U-Net3+ segmentation network is applied to remote sensing semantic segmentation, the network calculation amount is large, and the occupied resources are more; meanwhile, characteristic graphs of different scales of the U-Net3+ segmentation network are spliced, characteristic information is fully utilized, but the simple splicing stacks useless information from each level encoder at the same time, so that information redundancy is caused, and the network cannot pay attention to the segmentation target better.

Disclosure of Invention

The invention provides a remote sensing semantic segmentation method based on U-Net3+ in order to solve the problems of low remote sensing image segmentation precision, unbalanced data category, large calculated amount and high occupied resource in the existing remote sensing semantic segmentation.

The technical scheme adopted by the invention for solving the technical problems is as follows:

the invention discloses a remote sensing semantic segmentation method based on U-Net3+, which comprises the following steps:

step one, data acquisition and preprocessing;

step two, constructing a U-Net3+ segmented network model, and adding a multi-scale feature extraction module and an attention mechanism into the constructed U-Net3+ segmented network model;

step three, constructing an improved mixed loss function, and applying the improved mixed loss function to the constructed U-Net3+ split network model;

step four, the preprocessed data is transmitted to the constructed U-Net3+ segmentation network model to carry out model training;

fifthly, performing semantic segmentation on the remote sensing image by using the trained U-Net3+ segmentation network model, and verifying the segmentation effect of the U-Net3+ segmentation network model;

and step six, image post-processing.

Further, in the first step, the acquired data is derived from a remote sensing semantic segmentation data set GID-5, the remote sensing semantic segmentation data set GID-5 includes a plurality of pictures, the pictures are divided into a training set, a verification set and a test set, and the number proportion of the pictures of the training set, the verification set and the test set is as follows: 25:1:4.

In the first step, the data is preprocessed, the picture is firstly sliced, then the insufficient position in the picture is filled with the background, and meanwhile, the experimental label image is converted into the gray image.

Further, the specific operation flow of the second step is as follows:

s2.1, a U-Net3+ split network is built, and the network level is reduced from 5 layers to 4 layers;

s2.2, constructing a multi-scale feature extraction module; the multi-scale feature extraction module comprises a multi-scale convolution attention module which is divided into three parts, wherein the first part is a 5×5 depth convolution for obtaining local feature information; the second part is different mixed cavity convolutions of multiple branches and is used for extracting multi-scale characteristic information; the third part is 1 multiplied by 1 convolution, which is responsible for mixing channels, and finally multiplying the input characteristics with the convolved weight element by element to obtain the required output;

s2.3, combining the residual error module with a CBAM attention mechanism, and adding the residual error CBAM attention module in the characteristic fusion stage of each layer of the network, so that the network focuses on important information.

Further, in the third step, the mixed Loss function is formed by combining a variation Log Cosh Dice Loss of a Dice Loss function Dice Loss and a Focal Loss function Focal Loss;

the Dice Loss function Dice Loss is defined as:

wherein X represents the prediction result of the model on the target image, and Y represents the real label of the target image;

said variant Log Cosh Dice Loss is defined as:

L _{log-cosh-Dice} ＝log(cosh(DiceLoss)

wherein the dash is defined as:

the Focal Loss function Focal Loss is defined as:

L _Focal ＝-α _t (1-p _t ) ^γ log(p _t )

wherein alpha is _t Is a balance factor for balancing the importance of positive and negative samples; gamma rayThe value is 2;

wherein p represents the probability that the predicted sample belongs to 1 (range is 0-1), and y represents the label;

the final mixing loss function is defined as:

further, the specific operation flow of the fourth step is as follows: inputting the preprocessed picture in the first step into a constructed U-Net3+ segmentation network, updating the network parameter weight, and verifying by using the picture of a verification set to obtain a network segmentation effect, and continuously storing a better network model to obtain the optimal network model.

In the fifth step, the test set picture is input into the optimal network model to obtain a segmentation effect picture, and the segmentation effect picture is compared with the experimental label picture to obtain segmentation accuracy.

In the sixth step, the segmentation effect map obtained in the fifth step is a gray level map, the segmentation effect map obtained in the fifth step is spliced according to the position of the first slice, the background filled in the first step is removed, and the segmentation effect map is mapped into the original picture according to the positions of the gray level maps to obtain the final segmentation map.

The beneficial effects of the invention are as follows:

the remote sensing semantic segmentation method based on U-Net3+ solves the problems that a plurality of current remote sensing semantic segmentation networks are too complex, the model calculation amount is large, the remote sensing image segmentation precision is low, the data category is unbalanced and the occupied resources are high. The invention mainly aims to allocate each pixel in the remote sensing image to different semantic categories so as to realize fine classification and segmentation of the surface features. The method aims at converting the complex and changeable remote sensing image into pixel-level semantic information so as to further understand and analyze various objects, landforms and environments on the ground surface. According to the invention, through the improvement of a segmentation network, a multi-scale feature extraction module is introduced to extract image features from multi-scale directions; simultaneously introducing a residual CBAM attention module to make the network focus on the target area; in addition, the invention introduces a new loss function to balance the category in the sample, overcomes the problem of unbalanced category and improves the segmentation precision of remote sensing semantic segmentation. According to the invention, the network model is subjected to light weight treatment, the network parameter calculation amount and the network complexity are relatively low, the occupied calculation resources are less, and the subsequent arrangement and implementation are facilitated.

Drawings

FIG. 1 is a flow chart of a remote sensing semantic segmentation method based on U-Net3+.

Fig. 2 is a schematic structural diagram of the constructed U-net3+ split network.

Fig. 3 is a schematic structural diagram of a multi-scale feature extraction module.

Fig. 4 is a diagram of a full-scale hopping connection used in the network.

Fig. 5 is a schematic diagram of the structural composition of the residual CBAM attention module.

Detailed Description

The technical solutions of the embodiments of the present invention will be clearly and completely described below in conjunction with the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1 for explanation, the remote sensing semantic segmentation method based on U-net3+ of the present invention performs accurate segmentation on each position included in the provided remote sensing image according to the position of the provided remote sensing image, and mainly includes the following steps:

step one, data acquisition and preprocessing;

and step six, image post-processing.

The invention discloses a remote sensing semantic segmentation method based on U-Net3+, which comprises the following specific operation flow:

step one, data acquisition and preprocessing;

s1.1: the invention uses the remote sensing semantic segmentation dataset GID-5 for algorithm training, the dataset GID-5 comprises 150 pictures, the sizes of the pictures are 6800 multiplied by 7200 (pixels), 125 pictures are selected manually as a training set, 5 pictures are taken as a verification set, and 20 pictures are taken as a test set.

S1.2: because of the limitation of the computer video memory, the complete picture cannot be input into a network, the picture is firstly sliced to be changed into a picture with the size of 512 multiplied by 512 (pixels), then the insufficient position in the picture is filled with the background, and meanwhile, the experimental label picture is converted into a gray scale picture.

s2.1, a U-Net3+ split network is built, the network level is reduced from 5 layers to 4 layers, and the built U-Net3+ split network structure is shown in fig. 2.

As shown in fig. 2,1 represents a primary feature map, 2 represents a secondary feature map, 3 represents a tertiary feature map, 4 represents a quaternary feature map, 5 represents an intermediate feature map, 6 represents a new tertiary feature map, 7 represents an intermediate feature map, 8 represents a new secondary feature map, 9 represents an intermediate feature map, 10 represents a new primary feature map, 11, 12, 13 and 14 all belong to the results after the feature map convolution operation of different stages under depth supervision, 15 represents the prediction result of the network, and 16 represents the input image.

The U-Net3+ split network is divided into an encoding stage and a decoding stage, and mainly comprises: a multi-scale feature extraction module and a residual CBAM attention module: the residual CBAM attention module is mainly formed by connecting a residual module with a CBAM attention mechanism through jumping.

In the encoding stage, i.e. the feature extraction stage, firstly, the residual error module is used for processing the input image 16 to obtain a first-level feature map 1, then the multi-scale feature extraction module is used for processing the first-level feature map 1 to obtain a second-level feature map 2, secondly, the residual error module is used for processing the second-level feature map 2 to obtain a third-level feature map 3, finally, the residual error module is used for processing the third-level feature map 3 to obtain a fourth-level feature map 4, and compared with the adjacent upper-level feature map, the size of the lower-level feature map is halved and the number of channels is doubled.

In the decoding stage, starting from the three-level feature map 3, up-sampling or pooling is performed on the feature maps of other levels to keep the sizes of the feature maps consistent, channel information of the feature maps is fused and then processed by a residual CBAM module to obtain a new feature map, and the new feature map is used for repeating the above operations to sequentially obtain a new two-level feature map 8 and a new one-level feature map 10.

Obtaining prediction results of different sizes from feature images of different levels through convolution operation, performing depth supervision on the prediction results in a training stage, upsampling the prediction results of different sizes to the same size of an input image, and then calculating loss and updating gradient; in the test phase, the results after the primary feature map processing are used as the final predicted results 15.

S2.2, constructing a multi-scale feature extraction module, wherein the multi-scale feature extraction module comprises a multi-scale convolution attention module, and the multi-scale convolution attention module is divided into three parts, and the first part is a 5 multiplied by 5 depth convolution for acquiring local feature information as shown in FIG. 3; the second part is different mixed cavity convolutions of multiple branches and is used for extracting multi-scale characteristic information; the third part is 1×1 convolution, which is responsible for mixing channels, and finally multiplying the input features with the convolved weights element by element to obtain the required output. Specifically, in the multi-scale feature extraction module, input features are subjected to 1×1 convolution processing firstly, the number of channels is reduced, then the input features are input to the multi-scale convolution attention module to obtain convolution attention, then BN normalization and GELU activation operations are carried out on results, after features are fully extracted through 3×3 depth convolution, dimension increasing operations are carried out, and finally, the feature matrix after dimension increasing is added with input features which are subjected to dimension increasing operations. The invention adopts a multi-scale feature extraction module in the encoding stage of the network, uses a plurality of mixed cavity convolutions to extract multi-scale feature information of the input image, and fully utilizes the context information.

As shown in fig. 3, LFMSCA represents a multi-scale feature extraction module and LMSCA represents a multi-scale convolution attention module. Wherein, (128, din) represents an input feature map with a number of channels of 128; (128, 1×1, 32) represents a convolution operation with a convolution kernel size of 1×1, a number of input channels of 128, and a number of output channels of 32; (d, 5×5) represents a depth convolution operation with a convolution kernel size of 5×5; (32, 1×1, 256) represents a convolution operation with a convolution kernel size of 1×1, a number of input channels of 32, and a number of output channels of 256; (128, 1×1, 256) represents a convolution operation with a convolution kernel size of 1×1, a number of input channels of 128, and a number of output channels of 256; (256, d out) an output characteristic map indicating the number of channels as 256; (3×3, r=1) represents a hole convolution operation with a convolution kernel size of 3×3 and a hole rate of 1; (3×3, r=2) represents a hole convolution operation with a convolution kernel size of 3×3 and a hole rate of 2; (3×3, r=3) represents a hole convolution operation with a convolution kernel size of 3×3 and a hole rate of 3.

S2.3, combining a residual error module with a CBAM attention mechanism, and adding the residual error CBAM attention module in each level characteristic fusion stage of the network, so that the network focuses on important information, redundant information is effectively avoided, and the expression capability of the network is improved. According to the method, a CBAM attention mechanism is added in a multi-scale feature fusion stage, so that a network is better focused on a region needing to be segmented, and the segmentation accuracy is improved. In addition, the invention reduces the network complexity and the parameter calculation amount by reducing the network level and the conversion channel.

Specifically, as shown in fig. 4, the feature fusion stage of each level of the network specifically adopts full-scale jump connection, wherein,representing the first layer encoder->Representing the second layer encoder->Representing a third layer encoder->Representing a fourth layer encoder, maxpooling (2) represents a max-pooling operation with a downsampling rate of 2, maxpooling (4) represents a max-pooling operation with a downsampling rate of 4, maxpooling (8) represents a max-pooling operation with a downsampling rate of 8, and Conv represents a convolution operation.

As shown in fig. 5, the residual CBAM attention module is mainly formed by connecting a residual module and a CBAM attention mechanism through jump. Firstly, calculating channel attention feature map information for an input feature map F, and multiplying the channel attention feature map information by the input feature map to carry out self-adaptive feature correction; then calculating the space attention characteristic diagram information and carrying out characteristic correction to obtain F _S The method comprises the steps of carrying out a first treatment on the surface of the Finally, the input feature image and the corrected feature image are subjected to channel information fusion through jump connection to obtain a final output feature image F _out 。

specifically, the training uses a mixed Loss function, which is formed by combining a variation Log Cosh Dice Loss of the conventional Dice Loss function Dice of image semantic segmentation and the Focal Loss function Focal Loss.

The Dice Loss function Dice Loss is defined as:

wherein X represents the prediction result of the model on the target image, and Y represents the real label of the target image.

Said variant Log Cosh Dice Loss is defined as:

L _{log-cosh-Dice} ＝log(cosh(DiceLoss)

wherein the dash is defined as:

the Focal Loss function Focal Loss is defined as:

L _Focal ＝-α _t (1-p _t ) ^γ log(p _t )

wherein alpha is _t As a balance factor, in general [0,1 ]]In the range, the importance of positive and negative samples is balanced; gamma is 2;

the final mixing loss function is defined as:

the invention designs the mixing loss function, and solves the problem of low segmentation precision caused by unbalanced category in the remote sensing image. The mixed loss function fuses the Dice loss function and the focus loss function, improves the unbalance problem of image samples, and improves the precision of remote sensing semantic segmentation.

specifically, the picture preprocessed in the first step and used for training is input into a constructed U-Net3+ segmentation network, the network parameter weight is updated, the picture of a verification set is used for verification, the network segmentation effect is obtained, and a better network model is continuously stored, so that the optimal network model is obtained.

specifically, the divided test set picture in the step four is input into the optimal network model stored in the step four to obtain a segmentation effect picture, and the segmentation effect picture is compared with an experimental label picture to obtain segmentation accuracy.

Step six: post-processing of the image;

and fifthly, the obtained segmentation effect graph is a gray level graph, the obtained segmentation effect graph is spliced according to the position of the slice in the step one, the background filled in the step one is removed, the graph is restored to be 6800 multiplied by 7200 (pixels), and the final segmentation graph is obtained by mapping the position of each class of the gray level graph into the original graph.

The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which are intended to be comprehended within the scope of the present invention.

Claims

1. The remote sensing semantic segmentation method based on U-Net3+ is characterized by comprising the following steps of:

step one, data acquisition and preprocessing;

and step six, image post-processing.

2. The U-net3+ based remote sensing semantic segmentation method according to claim 1, wherein in the first step, the acquired data is derived from a remote sensing semantic segmentation dataset GID-5, the remote sensing semantic segmentation dataset GID-5 comprises a plurality of pictures, the pictures are divided into a training set, a verification set and a test set, and the number proportion of the pictures in the training set, the verification set and the test set is as follows: 25:1:4.

3. The remote sensing semantic segmentation method based on U-Net3+ according to claim 2, wherein in the first step, preprocessing is performed on data, slicing is performed on a picture, background filling is performed on insufficient positions in the picture, and meanwhile, an experimental label graph is converted into a gray scale graph.

4. The remote sensing semantic segmentation method based on U-Net3+ according to claim 1, wherein the specific operation flow of the second step is as follows:

5. The remote sensing semantic segmentation method based on U-Net3+ according to claim 1, wherein in the third step, the mixed Loss function is formed by combining a variation Log Cosh Dice Loss of a Dice Loss function Dice and a focus Loss function Focalloss;

the Dice Loss function Dice Loss is defined as:

said variant Log Cosh Dice Loss is defined as:

L _{log-cosh-Dice} ＝log(cosh(DiceLoss)

wherein the dash is defined as:

the focus loss function FocalLoss is defined as:

L _Focal ＝-α _t (1-p _t ) ^γ log(p _t )

wherein alpha is _t Is a balance factor for balancing the importance of positive and negative samples; gamma is 2;

the final mixing loss function is defined as:

6. the remote sensing semantic segmentation method based on U-Net3+ according to claim 2, wherein the specific operation flow of the fourth step is as follows: inputting the preprocessed picture in the first step into a constructed U-Net3+ segmentation network, updating the network parameter weight, and verifying by using the picture of a verification set to obtain a network segmentation effect, and continuously storing a better network model to obtain the optimal network model.

7. The remote sensing semantic segmentation method based on U-Net3+ according to claim 6, wherein in the fifth step, the test set picture is input into an optimal network model to obtain a segmentation effect picture, and the segmentation effect picture is compared with an experimental label picture to obtain segmentation accuracy.

8. The remote sensing semantic segmentation method based on U-Net3+ according to claim 7, wherein in the sixth step, the segmentation effect map obtained in the fifth step is a gray level map, the segmentation effect map obtained in the fifth step is spliced according to the position of the first slice, the background filled in the first step is removed, and the final segmentation map is obtained by mapping the positions of the gray level maps into the original picture.