CN114519819A

CN114519819A - Remote sensing image target detection method based on global context awareness

Info

Publication number: CN114519819A
Application number: CN202210126106.0A
Authority: CN
Inventors: 张科; 吴虞霖; 王靖宇; 苏雨; 张烨; 李浩宇; 谭明虎
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2022-02-10
Filing date: 2022-02-10
Publication date: 2022-05-20
Anticipated expiration: 2042-02-10
Also published as: CN114519819B

Abstract

The invention discloses a remote sensing image target detection method based on global context awareness, which comprises the steps of extracting the characteristics of an image by using a depth residual error Network (ResNet 101), further extracting the characteristics by using a characteristic Pyramid Network (FPN, Feature Pyramid Network) and generating a candidate region; after the candidate region is generated, using the feature pooling alignment feature; adding a global context extraction module at the highest layer of the feature extraction network, and fusing extracted features and original features in an addition mode to obtain new features; and finally, classifying the new features by using the full connection layer to generate a target class and a frame. The method fully extracts scene information of the image by utilizing the characteristic of rich high-level feature semantic information, further enhances feature representation, increases the recognition accuracy of dense targets, and also improves the recognition accuracy of other targets to a certain extent, thereby integrally improving the target detection performance in the remote sensing image.

Description

Remote sensing image target detection method based on global context awareness

Technical Field

The invention belongs to the technical field of pattern recognition, and particularly relates to a remote sensing image target detection method.

Background

Remote sensing image analysis is always a hot point of research in the field of computer vision and is widely applied to the fields of urban planning, land utilization management, environmental monitoring and the like. The target detection is a basic task in the field of computer vision, and can provide support for subsequent tasks such as event detection, target tracking, human-computer interaction, scene segmentation and the like. The remote sensing image is usually shot from high altitude, and the shooting angle and the shooting height are not fixed due to the difference of an onboard or satellite-borne sensor. Compared with natural images, the remote sensing image has richer scene information, more target types and dense arrangement, so that the remote sensing image target detection faces a great challenge. Due to the problems, although some algorithms have been proposed for remote sensing image target detection, the performance still has room for improvement, so that remote sensing image target detection is still one of the hot problems of current research.

Stanku (the feature-enhanced SSD algorithm and the application thereof in remote sensing target detection), photonics newspaper 2020,49(01): 154-. The method improves the extraction capability of the network on the small target features by designing a shallow feature enhancement module; and designing a deep feature enhancement module to replace a deep network in the SSD pyramid feature layer. However, the method does not fully utilize rich scene information in the remote sensing image, and the improvement effect is limited.

The original FPN detection algorithm has a poor detection effect on dense targets in a remote sensing image because of the lack of sufficient scene information in a feature pyramid network. The detection of dense objects needs to rely on scene information, for example, cars are only present in parking lots or roads, and cars are generally present around cars. Thus, lack of awareness of context information, i.e., global context, makes it difficult for the network to identify dense targets.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a remote sensing image target detection method based on global context sensing, which comprises the steps of extracting the characteristics of an image by using a depth residual error Network (ResNet 101), further extracting the characteristics by using a characteristic Pyramid Network (FPN) and generating a candidate region; after the candidate region is generated, using the feature pooling alignment feature; adding a global context extraction module at the highest layer of the feature extraction network, and fusing extracted features and original features in an addition mode to obtain new features; and finally, classifying the new features by using the full connection layer to generate a target class and a frame. The method fully extracts scene information of the image by utilizing the characteristic of rich high-level feature semantic information, further enhances feature representation, increases the recognition accuracy of dense targets, and also improves the recognition accuracy of other targets to a certain extent, thereby integrally improving the target detection performance in the remote sensing image.

The technical scheme adopted by the invention for solving the technical problem comprises the following steps:

step 1: preprocessing and dividing the data set;

uniformly cutting the marked images in the standard data set into a plurality of 1024-by-1024 images, wherein the width and the height of the images are respectively reserved with the coincidence rate of 10% of pixels during cutting, and then randomly dividing the images into a training set, a verification set and a test set, wherein the training set, the verification set and the test set are not intersected;

step 2: constructing a target detection deep neural network and training the target detection deep neural network by adopting a gradient descent and back propagation algorithm; the target detection deep neural network firstly adopts Res101 residual error network to extract features, then uses feature pyramid network FPN to generate candidate regions, then carries out local context perception on the candidate regions, and finally obtains target categories and bounding boxes through feature pooling and full connection layers, wherein the method specifically comprises the following steps:

step 2-1: initializing Res101 model parameters by using a pre-training model;

step 2-2: inputting 1024 × 1024 images into a Res101 residual network to extract features, and generating 5 feature maps with different sizes, which are marked as C1-C6, and the scales are 512 × 512,256 × 256,128 × 128,64 × 64,32 × 32 and 16 × 16 respectively;

step 2-3: performing global maximum pooling on the feature map C6 to obtain scene features containing scene information; the scene features are subjected to 10 × 10 convolution and 1 × 1 convolution to obtain global features;

step 2-4: taking the feature map C5 as a feature map P5 of a feature pyramid;

upsampling the feature map C5, and adding the upsampled feature map C4 subjected to 1-by-1 convolution to generate a feature map P4 of a feature pyramid;

upsampling the feature map C4, and adding the upsampled feature map C3 subjected to 1-by-1 convolution to generate a feature map P3 of a feature pyramid;

upsampling the feature map C3, and adding the upsampled feature map C2 subjected to 1-by-1 convolution to generate a feature map P2 of a feature pyramid;

step 2-5: feature maps P2, P3, P4, and P5 of the feature pyramid are 256 in size, respectively²、128²、64²、32²(ii) a Generating anchor points anchorages for each feature image in the feature pyramid by using an area generation network, wherein the aspect ratio corresponding to each anchorage comprises three types, namely 1:2, 1:1 and 2: 1; thus, the feature pyramid generates 15 different anchors;

and generating a target candidate region by using the anchor, wherein the calculation formula is as follows:

wherein (x)_c,y_c) As the anchor point coordinates, (w, h) are the width and height of the target candidate region, respectively, (x)₁,y₁) And (x)₂,y₂) Coordinates of the upper left corner and the lower right corner of the target candidate area;

calculating the intersection ratio IoU of the target candidate region and the real label: if IoU is greater than or equal to 0.7, the target candidate area is set as a positive sample; if IoU <0.3, set the target candidate region to a negative sample; taking the obtained positive sample and negative sample as labels of the training target candidate area;

step 2-6: performing feature pooling on the target candidate region, and calculating a feature layer corresponding to the target candidate region after the feature pooling by adopting an equation (2):

wherein 1024 refers to the input image size, k₀Is a reference value;

since the target candidate region is generated from four different feature maps P2, P3, P4 and P5 through anchors, the feature pooling corresponds to 4 different feature layers;

the value rules of 4 different feature layers after the feature pooling are as follows:

after the target candidate regions in the feature maps P2, P3, P4 and P5 are subjected to feature pooling, each target candidate region respectively outputs 7 × 7 results, that is, 49 features are extracted;

step 2-7: adding the 49 features obtained in the step 2-6 and the global features obtained in the step 2-3, and sequentially inputting the 49 features and the global features into two full-connection layers, wherein the output results of the two full-connection layers are a target category and a target boundary box;

and step 3: and inputting the remote sensing image to be detected into the trained target detection depth neural network, and outputting to obtain a class boundary box of the target.

Preferably, k is₀＝4。

The invention has the following beneficial effects:

the method fully extracts scene information of the image by utilizing the characteristic of rich high-level feature semantic information, further enhances feature representation, increases the recognition accuracy of dense targets, and also improves the recognition accuracy of other targets to a certain extent, thereby integrally improving the target detection performance in the remote sensing image.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

FIG. 2 is a diagram of a target detection deep neural network according to the present invention.

Detailed Description

The invention is further illustrated with reference to the following figures and examples.

The invention discloses a remote sensing image target detection method based on global context sensing, which is designed to improve the accuracy of target identification in a remote sensing image by extracting global context characteristic enhancement feature representation.

As shown in fig. 1, a remote sensing image target detection method based on global context awareness includes the following steps:

step 1: the DOTA dataset is processed. Because the size of the original data image of the DOTA data set is not fixed and the labeled data of the test set is not disclosed, 1869 images with labels are uniformly cut into 1024 x 1024 images for the convenience of neural network training, and the width and the height of each image are respectively reserved with the overlapping rate of 10% of pixels in order to prevent the target from being lost due to image cutting during cutting. The 19219 images and the labeling information thereof are obtained after processing, and are randomly divided into 11531 images of the training set, 3844 images of the verification set and 3844 images of the test set, so that no intersection exists among the training set, the verification set and the test set in the image sample space.

Step 2: as shown in fig. 2, a target detection deep neural network is constructed and trained by using gradient descent and back propagation algorithm; the target detection deep neural network firstly adopts Res101 residual error network to extract features, then uses feature pyramid network FPN to generate candidate regions, then carries out local context perception on the candidate regions, and finally obtains target categories and bounding boxes through feature pooling and full connection layers, wherein the method specifically comprises the following steps:

step 2-1: because the neural network parameters are various and are difficult to train, before the model is trained, a pre-training model is used for initializing Res101 model parameters;

step 2-2: training a neural network on a training data set, inputting 1024 × 1024 images into a Res101 residual network to extract features, and generating 5 feature maps with different sizes, which are marked as C1-C6, and the scales are 512 × 512,256 × 256,128 × 128,64 × 64,32 × 32 and 16 × 16 respectively; selecting C2, C3, C4, and C5 establishes a pyramid. If C1 is used, too much memory is occupied, so the pyramid is not built by C1;

step 2-4: taking the feature map C5 as a feature map P5 of a feature pyramid;

the 1 × 1 convolution is to ensure that the number of added feature map channels is the same;

step 2-5: the feature maps P2, P3, P4 and P5 of the feature pyramid are 256 in size respectively²、128²、64²、32²(ii) a Generating anchor points anchorages for each feature image in the feature pyramid by using an area generation network, wherein the aspect ratio corresponding to each anchorage comprises three types, namely 1:2, 1:1 and 2: 1; thus, the feature pyramid generates 15 different anchors;

and generating a target candidate region by using an anchor, wherein the calculation formula is as follows:

calculating the intersection ratio IoU (intersection over Union) of the target candidate region and the real label: if IoU is more than or equal to 0.7, the target candidate area is set as a positive sample; if IoU <0.3, set the target candidate region to a negative sample; using the obtained positive sample and negative sample as labels of training target candidate regions (corresponding to each anchor);

wherein 1024 refers to the input image size, k₀Taking 4 as a reference value generally;

step 2-7: adding the 49 features obtained in the step 2-6 and the global features obtained in the step 2-3, and sequentially inputting the 49 features and the global features into two full connection layers, wherein the output results of the two full connection layers are a target category and a target bounding box;

Claims

1. A remote sensing image target detection method based on global context sensing is characterized by comprising the following steps:

step 1: preprocessing and dividing the data set;

uniformly cutting the marked images in the standard data set into a plurality of 1024 x 1024 images, respectively reserving the coincidence rate of 10% of pixels in width and height during cutting, and randomly dividing the images into a training set, a verification set and a test set, wherein the training set, the verification set and the test set have no intersection;

and 2, step: constructing a target detection deep neural network and training the target detection deep neural network by adopting a gradient descent and back propagation algorithm; the target detection deep neural network firstly adopts Res101 residual error network to extract features, then uses feature pyramid network FPN to generate candidate regions, then carries out local context perception on the candidate regions, and finally obtains target categories and bounding boxes through feature pooling and full connection layers, wherein the method specifically comprises the following steps:

step 2-1: initializing Res101 model parameters by using a pre-training model;

step 2-4: taking the feature map C5 as a feature map P5 of a feature pyramid;

upsampling the feature map C5, and adding the upsampled feature map C4 with the feature map C4 subjected to 1-by-1 convolution to generate a feature map P4 of a feature pyramid;

wherein 1024 refers to the input image size, k₀Is a reference value;

2. The remote sensing image target detection method based on global context awareness, as recited in claim 1, wherein k is a function of the global context awareness₀＝4。