CN115761223A

CN115761223A - Remote sensing image instance segmentation method by using data synthesis

Info

Publication number: CN115761223A
Application number: CN202211288030.8A
Authority: CN
Inventors: 李鹏程; 白文浩; 周杨; 邢帅; 蓝朝桢; 张衡; 施群山; 吕亮; 胡校飞
Original assignee: Information Engineering University of PLA Strategic Support Force
Current assignee: Information Engineering University of PLA Strategic Support Force
Priority date: 2022-10-20
Filing date: 2022-10-20
Publication date: 2023-03-07

Abstract

The invention belongs to the field of remote sensing image identification, and particularly relates to a remote sensing image example segmentation method by utilizing data synthesis. According to the method, the reliability of instance segmentation is improved by optimizing the constructed instance segmentation model and the data set of the training model. The method aims at optimizing the training data set, utilizes known image resources to expand the data set, increases random brightness and contrast adjustment of image data in the training data set, improves the applicability of the training data set, optimizes the setting of increasing and reducing the image proportion, and considers the identification of large and small targets. The Swin transform model is used as a backbone network in the instance segmentation model, so that multi-scale information can be fully acquired, and the segmentation accuracy is improved; a cascade prediction head is arranged, so that the stability and the accuracy of prediction are considered; and the RPN network in the example segmentation model is subjected to detail optimization, the proportion coefficient of the candidate frame is adjusted to adapt to the conventional sensitive target of the remote sensing image, and the regularization loss weight is adjusted to enhance the overfitting resistance of the model.

Description

Remote sensing image instance segmentation method by using data synthesis

Technical Field

The invention belongs to the field of remote sensing image identification, and particularly relates to a remote sensing image example segmentation method by using data synthesis.

Background

The remote sensing image target instance segmentation plays a significant role in various fields such as intelligent transportation, environment monitoring, urban planning and the like. In recent years, the deep learning technology shows better accuracy and efficiency in remote sensing image target instance segmentation. The Mask R-CNN algorithm is an existing mainstream example segmentation algorithm, and can generate high-quality prediction object class labels and pixel masks while ensuring the simplicity and flexibility of a model.

A plurality of models are derived based on Mask R-CNN, such as Mask scaling R-CNN, BMask R-CNN, PANet and the like, but the models are all based on a convolutional neural network, and the segmentation capability of the models needs to be improved because the convolutional neural network is limited by the size of a convolutional kernel and cannot acquire global information from low-level features. On the other hand, when the sample data of the embodiment segmentation is manufactured, manual edge acquisition is required by means of Photoshop, labelme and other software, a large amount of human resources are consumed, and labeling efficiency is low, so that the problems that the existing sample set for remote sensing image embodiment segmentation has small data size and single target category exist.

Disclosure of Invention

The invention aims to provide a remote sensing image example segmentation method by using data synthesis, which is used for solving the problems that the image information is not comprehensively extracted, the reliability of a segmentation result is low, a training sample set has small data size and a single target class in the conventional remote sensing image example segmentation mode.

In order to achieve the purpose, the invention provides a remote sensing image example segmentation method synthesized by data, which comprises the following steps:

1) Constructing a training data set by utilizing the remote sensing image segmentation data set and the remote sensing image classification data set;

2) Constructing an example segmentation model; the example segmentation model adopts a Cascade Mask R-CNN model, and the Cascade Mask R-CNN model comprises a backbone network, an FPN network, an RPN network, an ROI Align module and a cascaded gauge head;

the trunk network is a Swin transform model and is used for obtaining the characteristics of the images with different scales; the FPN network is used for performing up-sampling and feature splicing on features of different scales obtained by the main network to obtain feature maps of various scales of multi-scale information fusion; the RPN is used for generating a candidate frame aiming at each scale feature map, and the ROI Align module is used for cutting the feature map corresponding to the candidate frame and scaling the feature map to the same size; the cascaded prediction head is used for carrying out cascaded prediction on the feature map output by the ROI Align module according to different IoU thresholds to obtain a segmentation result;

3) Training the example segmentation model through the training data set to obtain a trained example segmentation model;

4) And inputting the remote sensing image to be segmented into the trained example segmentation model, and obtaining an example segmentation result of the remote sensing image.

The method adopts a Swin transform model as a backbone network, can fully acquire multi-scale information during instance segmentation, and improves the segmentation accuracy; and each prediction head of the cascade prediction carries out prediction with higher requirements on the basis of the boundary box result generated by the previous stage of prediction, and the stability and the accuracy of the prediction result are considered. Meanwhile, the method utilizes the known image resources to expand the data set, thereby enhancing the applicability of the data set and avoiding the problem of unreliable segmentation results caused by large distribution difference between a training set and the remote sensing image to be segmented due to lack of environmental information and over sparse target in the data set.

Further, the construction process of the training data set is as follows:

selecting a target image in the remote sensing image segmentation data set, selecting a background image from the remote sensing image classification data set, splicing the target image into the background image to obtain synthetic image data, and constructing a training data set by using the synthetic image data.

Further, in order to avoid the situation that a large number of target images exist in the selected background image, which causes that the target images in the segmented data sets are easily shielded during image stitching, so that the original target images in the background image are incomplete and other polluted data sets, a category image which contains less than two target images in the remote sensing image classification data set is selected as the background image for stitching.

Furthermore, because the difference of the exposure conditions of the real remote sensing image is large, in order to adapt to the situations of overexposure, color cast and the like, the method also comprises the step of adjusting the random brightness and the contrast of the image data in the training data set.

Further, the method also comprises the step of randomly scaling the image data in the training data set, wherein the random scaling is performed by increasing the proportion of the image reduction operation in the random scaling operation.

Compared with the traditional convolutional neural network, the detection model using Swin transform as the backbone network has obviously enlarged receptive field, but has block processing with fixed pixel size before inputting images, so that the difference of image detection targets with different sizes still can be caused, and the division capability of details smaller than the image block size target needs to be improved. In order to further enhance the ability of the algorithm to identify small targets, the image scaling is increased on the basis of detecting large targets. After the scaling is modified, more small-size targets are brought into the training process, and the capability of detecting the small targets by the network is improved within the same training round number.

Further, the Swin Transformer network includes a window attention module and a shift window attention module, where the window attention module is used to perform multi-head self-attention calculation in a set window, and the shift window attention module is used to shift the position of the set window.

Furthermore, in order to enable the proportion of the candidate frame to be more suitable for the proportion of the conventional sensitive target of the remote sensing image, the proportion coefficient of the candidate frame of the RPN network is {0.8,1.0,1.25}.

Further, to enhance the overfitting resistance of the model, smooth L of the RPN network is used when training the network model ₁ The regularization loss weight is set to 5.

Drawings

FIG. 1a is a representation of a remote sensing image classification dataset in an embodiment of a remote sensing image segmentation method using data synthesis according to the present invention;

FIG. 1b is a representation of a remote sensing image segmentation dataset in an embodiment of a remote sensing image example segmentation method using data synthesis in accordance with the present invention;

FIG. 1c is a diagram of a remote sensing image synthetic image dataset display in an embodiment of the remote sensing image example segmentation method using data synthesis of the present invention

FIG. 2 is a schematic diagram of a random scaling in an embodiment of a method for segmenting an example of a remote sensing image by data synthesis according to the present invention;

FIG. 3 is a diagram of an example segmentation model structure in an embodiment of a method for remote sensing image example segmentation using data synthesis according to the present invention;

fig. 4 is a view of a Swin transform model structure in an embodiment of the method for segmenting an example of a remote sensing image by data synthesis according to the present invention;

fig. 5 is a schematic diagram of a window and a moving window in an embodiment of a remote sensing image example segmentation method using data synthesis according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments.

Embodiment of method for segmenting remote sensing image example by utilizing data synthesis

The embodiment provides a remote sensing image example segmentation method by using data synthesis, which specifically comprises the following steps:

1) And constructing a training data set by utilizing the remote sensing image segmentation data set and the remote sensing image classification data set.

The data volume of a remote sensing image small target instance segmentation data set used for training an instance segmentation model is usually small, image data in the data set often have the problems of lack of environmental information, over-sparse targets and the like, and therefore the training set and a test set actually used for instance segmentation operation have large distribution difference of the targets, and therefore when the obtained remote sensing image is subjected to instance segmentation according to the segmentation model obtained by the training set, the reliability is relatively low.

In order to extend the training data set and enable the trained segmentation model to adapt to more real remote sensing image conditions, in this embodiment, the construction process of the training data set is as follows: selecting a target image in the remote sensing image segmentation data set, selecting a background image from the remote sensing image classification data set, splicing the target image into the background image to obtain synthetic image data, and constructing a training data set by using the synthetic image data. Namely, pasting a target part in the remote sensing image into an image of a specific scene type, and acquiring a composite image containing a target image. It should be noted that, in order to avoid the situation that a large number of target images exist in a selected background image, which causes that a target image in a segmentation data set is easily shielded during image stitching, so that a pollution data set such as an incomplete target image in the background image is easily generated, an image which does not include a target in a remote sensing image classification data set is selected as the background image for stitching; in this embodiment, the final composite image is obtained by splicing all targets of one target image based on the original background image, and in other embodiments, the set number may be adaptively selected according to actual conditions such as quality of images in the remote sensing image classification data set.

The following process of constructing the training data set is specifically described with reference to fig. 1a to 1c, taking the example segmentation of the airplane target as an example:

as shown in fig. 1a, the image size of AID30 remote sensing image classification data set is 600 × 600 pixels and the image quality is high, so the image therein is suitable for being used as a background image; as shown in fig. 1b, since the remote sensing image segmentation data set eosrssd includes a large amount of aircraft data with image size less than 600 × 600 pixels and lacking of surrounding environment, the aircraft target image is conveniently segmented and the target image is smaller than the background image, and therefore, the aircraft image part in the eosrssd data set is selected as the target image.

As shown in fig. 1c, the aircraft image portion in the eossd dataset is used as a target image, and is spliced into the Center, denseresistial, industrial, railwayStation, and Farmland category images in the remote sensing image classification dataset AID30 in a direct pasting manner to obtain synthetic image data, thereby constructing a training dataset.

The Airport category in the AID30 dataset is also relevant to aircraft targets, but is not selected as a background image. Because a large number of airplanes exist in the Airport type image under the data set, when image stitching is performed, if the EORSSD image is pasted to the background image, the original airplane image part on the Airport type image is shielded, so that the original airplane image is incomplete, and identification is difficult during instance segmentation, the data set is polluted, and therefore the type image possibly containing a large number of target images is not selected as the background image for stitching.

Since airports are usually located around cities and have relatively obvious terminal buildings. Therefore, the invention utilizes the characteristics that the Center is similar to the RailwayStation and the structure characteristics of the station building, and the DenseResidiental, industrial, farmland are consistent with the suburban surrounding environment, selects the type image close to the scene where the airplane target appears as the background image to be spliced, does not need to limit the selection range of the background image in the type relevant to the airplane target, and can obtain the synthetic image which is jointed with the actual airplane remote sensing image, thereby taking account of the quantity and the quality of the synthetic image data and obtaining the training data set which has larger data volume and is more jointed with the actual remote sensing image condition.

Considering that the difference of the exposure conditions of the real remote sensing image is large, in order to adapt to the situations of overexposure, color cast and the like, the embodiment also increases the adjustment of random brightness and contrast ratio on the basis of random cutting, overturning and scaling of the original image data in the training data set, and improves the applicability of the training data set.

In addition, compared with the conventional convolutional neural network, the detection model using Swin Transformer as the backbone network has a significantly enlarged sensing field, but has a blocking process with a fixed pixel size (usually default to 4 pixels) before the input image, which still causes the difference of the detection targets of images with different sizes. In order to further enhance the ability of the algorithm to identify the small targets, the random scaling method in the data enhancement stage is optimized, and after the scaling is modified, more small-size targets are brought into the training process, so that the ability of network detection of the small targets is improved within the same training round number, and the reduced image proportion is increased on the basis of considering detection of large targets; the above process actually means that after the image is reduced, the size of the target in the image is reduced, so that more small-size target images are used as training data, namely, the image containing the small-size target is generated by reducing the image, and the capability of the model for identifying the small-size target can be improved; when images in the training data set are randomly zoomed, the proportion of the reduced images is increased, namely, more small-size targets can be provided for recognition by carrying out the size reduction operation on more images in the data set. The scaling optimization method is shown in fig. 2, where origin represents the candidate scaling of the image in the prior art, resize represents the candidate scaling of the image in this embodiment, the horizontal axis represents the size of the pixels on the short side of the image, the vertical axis represents the presence or absence of the scaling factor, 1 represents the presence, and 0 represents the absence. Therefore, on the basis of rarefying the original scaling coefficient, the scaling optimization method adds the small-size scaling coefficient to the large-size image, namely, the large-size image is subjected to reduction operation to obtain the image containing the small-size target for model training.

2) And constructing an example segmentation model.

Referring to the example segmentation model structure of fig. 3, the example segmentation model adopts a Cascade Mask R-CNN model, and the Cascade Mask R-CNN model includes a backbone network, an FPN network, an RPN network, an ROI Align module, and a cascaded prediction head.

The trunk network is a Swin transform model and is used for obtaining the characteristics of the images in different scales. For an example segmentation task of a remote sensing image, a traditional convolutional neural network gradually acquires global information through operations such as pooling, and image features cannot be comprehensively acquired from low-level dimensions. The structure of the Swin Transformer model is shown in FIG. 4, and the Swin Transformer model comprises a window attention module and a moving window attention module, and is specifically formed by connecting a multilayer perceptron, a layer normalization, an attention module, a GELU activation function and the like through residual errors; the multi-head self-attention mechanism can effectively comprehensively use the information of multiple dimensions in the same layer from the beginning of feature extraction, and is favorable for detecting sensitive targets in complex and various image scenes. The Swin Transformer uses multi-head self-attention as a basic structure, and by taking the idea of ViT block embedding coding as reference, after an image is blocked by the size of 7 pixels by 7 pixels, an input window attention module and a shift window attention module are coded to obtain deep features. The multi-head self-attention calculation formula is shown as the following formula:

Z＝Contcat(A ₁ ,A ₂ ,…,A _h )W

q, K and V are features obtained after image block embedding coding, K and V are keys and value vectors, d represents the dimension of the features, W is a linear coefficient matrix, and finally the multi-head self-attention mechanism obtains the deep-level features Z by combining information of different independent feature spaces.

The window attention module is different from a general attention module, multi-head self-attention calculation is carried out only in a set window, and due to the fact that the complexity and the image size are in a linear growth relation, the window attention module reduces the calculation amount and provides more calculation space for a finer prediction task. However, due to the window limitation, the self-attention calculation is limited on the non-overlapped local window, so that the receptive field is limited, and a moving window attention module is added behind the window attention module. The moving window attention module is used for moving the position of a set window, so that different areas are brought into attention calculation, and the cross-window connection calculation efficiency is improved by using a unique mask mechanism; the window and the moving window are schematically shown in fig. 5.

Referring to fig. 3, the fpn network is used for performing up-sampling and feature splicing on features of different scales obtained by the backbone network to obtain feature maps of various scales of multi-scale information fusion; the RPN is used for generating candidate frames for each scale feature map; in the prior art, the proportion of an initial candidate frame provided by an RPN network designed for a natural scene is often {0.5,1.0,2.0}, and in this embodiment, according to the proportion of a conventional sensitive target of a remote sensing image, the proportion coefficient of the initial candidate frame is adjusted to {0.8,1.0,1.25}, so that the accuracy of first recognition is improved. Furthermore, in order to enhance the over-fitting resistance of the model, smooth L of the RPN network during training is used ₁ Regularization loss weight is increased from 1 to 5, i.e., model overfitting is alleviated by emphasizing the penalty when model complexity increases, where Smooth L ₁ The formula is shown as follows, wherein x is the difference between the target value and the predicted value;

and the ROI Align module is used for cutting the feature map corresponding to the candidate frame and zooming the feature map to the same size, and finally inputting the cut feature map into a prediction head for classification, bounding box regression and mask prediction to obtain a final example segmentation result.

The cascaded prediction head is used for performing cascaded prediction on the feature map output by the ROI Align module according to different IoU thresholds; when the IoU (interaction-over-Unit) threshold of the probe is smaller, more backgrounds can be detected in the positive sample, and false detection is caused; while false detection may be reduced when the IoU threshold is high, the number of positive samples is small, and thus the risk of overfitting is greater. Therefore, the prediction heads with different IoU thresholds are cascaded, each prediction head can be subjected to further and higher-requirement prediction on the basis of the boundary box result generated by the previous-stage prediction, and the stability and the accuracy of the prediction result are considered.

3) And training the example segmentation model through the training data set to obtain the trained example segmentation model. When the instance segmentation operation is carried out, the remote sensing image to be segmented is input into the trained instance segmentation model, and then the instance segmentation result of the remote sensing image can be obtained.

Comparative example of remote sensing image example segmentation method by using data synthesis

In the experiment, different model performances are compared, model comparison experiments are carried out on classical Mask R-CNN models and Cascade Mask R-CNN models and a remote sensing image example segmentation method (Cascade Mask R-CNN Adjusted) synthesized by data, which is provided by the invention, under the structure that a trunk network is Swin transducer-Small, and all models are trained for 40 rounds. The AP (Average Precision) index results for different test sets (Bbox test set, segm test set) without any additional data are shown in table 1; the promotion proportion refers to the ratio of AP index result promotion of the remote sensing image example segmentation method synthesized by data compared with each classical model.

Table 1 comparison of model test sets

Due to the difference of data distribution such as the background and the size of the target in the training set and the test set, the detection result index of each model is poor, but the Cascade Mask R-CNN is added due to the Cascade connection, so that the target detection precision is higher, and the AP value is higher; the modified model enhances the capability of resisting overfitting and detecting small targets, so that the detection result is further improved by 13.19% and 8.88% compared with the original method in the edge frame and the pixel level mask respectively.

The invention has the following characteristics:

1) In the stage of obtaining the training data set, the known image resources are utilized to expand the data set, so that the applicability of the data set is enhanced, and the problem that the segmentation result is unreliable due to the fact that the training set and the remote sensing image to be segmented have large distribution difference due to lack of environmental information and over sparse target in the data set is solved.

2) In the training data set data enhancement stage, compared with the prior art, the random brightness and contrast adjustment of the image data in the training data set is increased, the applicability of the training data set is improved, the setting for increasing and reducing the image proportion is optimized, and the identification of large targets and small targets is considered.

3) In the stage of constructing the instance segmentation model, a Swin Transformer model is used as a backbone network, so that multi-scale information can be fully acquired during instance segmentation, and the segmentation accuracy is improved; and each prediction head of the cascade prediction carries out prediction with higher requirements on the basis of the boundary box result generated by the previous-stage prediction, and the stability and the accuracy of the prediction result are considered.

4) And (3) performing detail optimization aiming at the RPN in the example segmentation model, adjusting the proportion coefficient of the candidate frame to adapt to the conventional sensitive target of the remote sensing image, and adjusting the regularization loss weight to enhance the overfitting resistance of the model.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting the same, and although the present invention is described in detail with reference to the above embodiments, those of ordinary skill in the art should understand that: modifications and equivalents may be made to the embodiments of the invention without departing from the spirit and scope of the invention, which is to be covered by the claims.

Claims

1. A remote sensing image example segmentation method by using data synthesis is characterized by comprising the following steps:

2. The method for segmenting remote sensing image instances by using data synthesis according to claim 1, wherein the training data set is constructed by the following process:

3. The method for segmenting the remote sensing image instance by using the data synthesis as claimed in claim 2, wherein a category image which contains less than two target images in the remote sensing image classification data set is selected as a background image to be spliced.

4. A method as claimed in claim 1 or 2, further comprising performing random brightness and contrast adjustment on the image data in the training data set.

5. A method as claimed in claim 1 or 2, further comprising randomly scaling the image data in the training data set by increasing the proportion of the image scaling operation in the random scaling operation.

6. The method for remote sensing image instance segmentation by data synthesis according to any one of claims 1 to 3, wherein the Swin transform network includes a window attention module and a shift window attention module, the window attention module is used for performing multi-head self-attention calculation within a set window, and the shift window attention module is used for shifting the position of the set window.

7. The method of claim 6, wherein the frame candidate scaling factor for the RPN network is {0.8,1.0,1.25}.

8. The method of claim 6, wherein the Smooth L of the RPN network is used in training the network model ₁ The regularization loss weight is set to 5.