CN111126202B

CN111126202B - Optical remote sensing image target detection method based on void feature pyramid network

Info

Publication number: CN111126202B
Application number: CN201911271302.1A
Authority: CN
Inventors: 应翔; 申继宁; 高洁; 刘志强; 于健; 李雪威; 喻梅; 于瑞国
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2019-12-12
Filing date: 2019-12-12
Publication date: 2022-03-04
Anticipated expiration: 2039-12-12
Also published as: CN111126202A

Abstract

The invention relates to an optical remote sensing image target detection method based on a void feature pyramid network, which is characterized by comprising the following steps of: the method comprises the following steps: s1, dividing the adopted optical remote sensing image data set into a training set and a testing set; s2, carrying out size transformation, standardization and normalization processing on the optical remote sensing image in the data set, and carrying out data enhancement on the training set; s3, constructing a hole characteristic pyramid network by using hole convolution, and training a network model by using the images in the training set; and S4, detecting the remote sensing image by using the trained target detection model, and analyzing and comparing the detection effect. The method is scientific and reasonable in design, the detection performance of the multi-scale target on the optical remote sensing image is improved by constructing the novel cavity feature fusion module, and better generalization capability is obtained.

Description

Optical remote sensing image target detection method based on void feature pyramid network

Technical Field

The invention belongs to the field of computer vision, relates to multi-scale target detection in computer vision tasks, and particularly relates to an optical remote sensing image target detection method based on a hole characteristic pyramid network.

Background

Object detection is one of the basic problems in computer vision recognition tasks, and has wide application in a plurality of fields. The target detection in the optical remote sensing image has wide application prospect in the aspects of military application, urban planning, environmental management and the like. Unlike target detection on natural images, targets on optical remote sensing images are much smaller than those on natural images, and the size and orientation of the targets are diverse (e.g., playground, car, bridge, etc.). Furthermore, the visual appearance of the target instance in the remotely sensed image varies largely due to occlusion, shading, lighting, resolution and viewpoint variations. Therefore, detection of objects on remote sensing images is much more difficult than detection of objects on natural images. At present, target detection algorithms on optical remote sensing images are mainly divided into two types: the target detection method based on the traditional image processing and machine learning algorithm and the target detection algorithm based on the deep learning.

Among them, the image Features used in the target detection method based on the conventional image processing and machine learning algorithm are designed manually, such as SIFT (Scale Invariant Feature Transform), HOG (Histogram of oriented gradients), SURF (Speeded Up Robust Features), and so on. The method comprises the steps of extracting features of an input image through a manually designed feature extractor, identifying a target according to the features, and positioning the target by combining a corresponding strategy. For a long time, object detection algorithms based on manually designed features have dominated the computer vision field. However, such manually-designed based features are not very robust to the diversity of targets

The target detection algorithm based on deep learning utilizes a deep convolutional neural network to automatically learn feature representation from data, and can learn the feature representation with good robustness and strong expression capability. With the rapid development of deep learning, the target detection algorithm is also shifted from the traditional algorithm based on manual features to the target detection algorithm based on the deep convolutional neural network. With the introduction of deep convolutional neural networks, the task of object detection has advanced greatly in both speed and accuracy over the past few years. At present, target detection algorithms based on deep convolutional neural networks are mainly classified into two types: a two-stage process and a one-stage process. The two-stage target detection method comprises the steps of firstly extracting candidate regions from a given image, and then classifying and regressing and positioning each extracted candidate region. The one-stage approach proposes a single, monolithic convolutional neural network that reconstructs the object detection problem into a regression problem to directly predict the class and location of the object. In general, the two-stage approach has advantages in accuracy, while the one-stage approach has advantages in speed. However, as target detection algorithms continue to evolve, both types of algorithms balance speed and accuracy.

At present, multi-scale target detection remains a challenging problem. The object detection task of the image is an extension of the classification task. Deep features in the deep convolutional neural network contain rich semantic information, which is beneficial to the image classification task, but lack detailed information beneficial to small target detection. Shallow features in the network have higher spatial resolution, which is beneficial to bounding box regression, but lack high-level semantic information beneficial to target classification. Therefore, in order to solve the problem of target scale variation in the target detection task, many convolutional neural network-based target detection algorithms gradually blend semantic information of deep features into shallow features, and various types of feature pyramid structures are proposed. The image pyramid adjusts the input image to different sizes of scales, and then inputs the images of different scales to the feature extraction network to generate feature maps of different scales. This approach can significantly increase memory and computational complexity and is inefficient. The method of Faster R-CNN 1, YOLOv1 2, R-FCN 3, etc. uses the feature graph output by the last layer convolution in the feature extraction network to detect the target. But because it only uses a single scale feature map for prediction, this method has poor detection performance for multi-scale targets, especially small targets. The SSD [4] algorithm constitutes a feature pyramid architecture by extracting multi-level features of the backbone network. Although the method selects the characteristics with different scales of multiple levels in the network for prediction, the detection effect on small targets depending on deep semantic information is poor because the context semantic information is not fused. The FPN 5 target detection algorithm adopts a top-down path and a transverse connection structure, fuses semantic information of shallow features and deep features, and detects targets with different sizes by using features of different levels. However, semantic information of the structural shallow feature cannot meet the detection requirement of the multi-scale target.

Although the above mentioned target detection algorithm has achieved a good effect on natural images, the detection accuracy on optical remote sensing images still needs to be improved, and especially the detection effect on multi-scale targets in optical remote sensing images is not ideal.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide an optical remote sensing image target detection method based on a hole feature pyramid network, which improves the detection performance of a multi-scale target on an optical remote sensing image by constructing a novel hole feature fusion module and obtains better generalization capability.

The technical problem to be solved by the invention is realized by the following technical scheme:

a method for detecting an optical remote sensing image target based on a hole characteristic pyramid network is characterized by comprising the following steps: the method comprises the following steps:

s1: dividing an adopted image data set into a training set which is 80% used for network model training and a testing set which is 20% used for model testing, and keeping the data distribution consistency of different types of samples in the training set and the testing set as much as possible;

s2: carrying out size transformation, standardization and normalization processing on the optical remote sensing image in the data set, and carrying out data enhancement on the training set;

s201, preprocessing the image in the data set on the basis of S1, performing size transformation on the adopted data set, and setting the shortest and longest edges of the input image as 600 and 1000 pixels respectively;

s202, calculating the RGB mean value of the selected data set according to the divided training set, and performing the operation of subtracting the RGB mean value on all samples in the training set and the testing set to highlight the characteristic difference among individuals in the image;

s203, carrying out image standardization and normalization processing on the images in the data set: according to the convex optimization theory and the data probability distribution related knowledge, performing centralized processing on data through a mean value removing operation to realize standardized processing of images; the normalization processing of the image is realized by mapping each pixel value in the image to the range of 0-1;

s204, enhancing data by adopting simple horizontal turning and random cutting operations, thereby increasing the number of training samples of a training set and ensuring the robustness of a target detection model;

s3: constructing a hole characteristic pyramid network by using hole convolution, and training a network model by using images in a training set;

s301, constructing a cavity feature pyramid network, selecting ResNet-101 as a basic network of a target detection network, extracting feature maps with different scales by using residual blocks for the ResNet-101, extracting the output of the last residual structure of the following four rolling blocks from the ResNet-101 basic network as basic features, and representing the basic feature maps as { C2, C3, C4 and C5 };

s302, in a hole feature fusion module AFFM of a pyramid network, C2 reduces the number of feature channels to 256 dimensions through Conv1 x 1, and { C3, C4, C5} respectively perform upsampling operation to bilinear interpolation on each feature map to the size of a C2 feature map, reduce the number of the feature map channels after upsampling to 256 dimensions through Conv1 x 1 operation, then connect the feature maps obtained above in series through Concat operation to obtain a multi-level fusion feature, and then apply Conv1 x 1 to reduce the feature dimension of the multi-level fusion feature to 256 dimensions;

s303, constructing a cavity transverse connection module, enabling three branches of the cavity transverse connection module to have different-size receptive fields by adopting three convolution operations of Conv1 multiplied by 1, Conv3 multiplied by 3 and Conv5 multiplied by 5, adding Conv3 multiplied by 3 operations with different cavity rates behind the three branches, and splicing feature maps generated by each branch together through the Concat operation, thereby obtaining a transverse connection feature map with stronger multi-scale expression capacity;

s304, generating a plurality of groups of feature maps with different scales from bottom to top through a multi-layer downsampling and hole transverse connection module, representing the feature maps as { P2, P3, P4 and P5}, respectively corresponding to { C2, C3, C4 and C5}, integrating features generated by multi-layer fusion features and features generated by a hole transverse connection module by using a channel cascade operation, and obtaining { P2, P3, P4 and P5} through a Conv1 × 1 operation, wherein the calculation functions of the feature maps are as follows:

wherein: p_iMulti-level features for input to a detection network header to predict results;

ALCB (Ci) is a multi-branch convolution operation function with convolution kernels of different sizes and hole rate;

Conv3×3(P_i-1) For convolution operations with a convolution kernel size of 3 x 3 and a step size of 2, i.e. for P_i-1Carrying out down-sampling operation;

s305, inputting the { P2, P3, P4, P5} generated by the hole feature fusion module into an area generation network and a detection network header behind the network model to further generate candidate areas and calculate detection results;

s306, training the constructed hole characteristic pyramid network by using the obtained training set, and adopting an approximate joint training strategy: the network model trains 100K iterations totally, and the learning rate of the first 60K iterations is 10^-3The learning rate of the next 20K iterations is 10^-4The weight decay and momentum are 0.00004 and 0.9, respectively;

s4: detecting the remote sensing image by using the trained target detection model, and analyzing and comparing the detection effect; and carrying out duplicate removal on the obtained detection result through non-maximum suppression operation, setting IoU threshold values of the non-maximum suppression to be 0.7, and selecting mAP as an evaluation index for measuring the target detection effect of the remote sensing image, wherein the IoU threshold value is set to be 0.5.

The invention has the advantages and beneficial effects that:

1. the invention relates to an optical remote sensing image target detection method based on a cavity feature pyramid network, which aims at the problem of multi-scale target detection in a remote sensing image, utilizes a cavity feature fusion module to construct the cavity feature pyramid network, can obviously improve the detection performance of an Faster R-CNN target detection algorithm, and realizes accurate identification and accurate detection of multi-scale targets in the optical remote sensing image.

2. The invention relates to an optical remote sensing image target detection method based on a void feature pyramid network, which can realize 96.70% mAP on a single Tesla K80 GPU and can reach 96.75% mAP on an RSOD data set on an NWPUVHR-10 optical remote sensing image data set.

3. The optical remote sensing image target detection method based on the cavity characteristic pyramid network is superior to the traditional two-stage method, namely fast R-CNN, FPN and the like in the result of the PASCAL VOC natural data set, and reaches 81.7 percent mAP.

4. The optical remote sensing image target detection method based on the hole characteristic pyramid network can improve the detection performance of multi-scale and complex appearance targets in remote sensing image data sets, and the target detection method has good generalization performance and good robustness for the targets with scale changes.

Drawings

FIG. 1 is a flow chart of the detection method of the present invention;

FIG. 2 is a schematic diagram of a void feature fusion module according to the present invention;

FIG. 3 is a schematic view of a cavity interconnect module according to the present invention;

FIG. 4 is a diagram illustrating the detection effect of the void feature pyramid network according to the present invention.

Detailed Description

The present invention is further illustrated by the following specific examples, which are intended to be illustrative, not limiting and are not intended to limit the scope of the invention.

s302, in a hole Feature fusion module AFFM (irregular Feature fusion module) of the pyramid network, C2 reduces the number of Feature channels to 256 dimensions through Conv1 × 1, { C3, C4, C5} respectively perform upsampling operation to bilinear interpolation on each Feature map to the size of a C2 Feature map, reduce the number of the upsampled Feature map channels to 256 dimensions through Conv1 × 1 operation, then connect the Feature maps obtained above in series through Concat operation to obtain a multi-level fusion Feature, and then apply Conv1 × 1 to reduce the Feature dimension of the multi-level fusion Feature to 256 dimensions;

s303, constructing a hole transverse Connection module ALCB (aperture transverse Connection Block), adopting three convolution operations of Conv1 multiplied by 1, Conv3 multiplied by 3 and Conv5 multiplied by 5 to enable three branches of the hole transverse Connection module to have reception fields with different sizes, adding Conv3 multiplied by 3 operations with different hole rates behind the three branches, and splicing feature graphs generated by each branch together through a Concat operation, thereby obtaining a transverse Connection feature graph with strong multi-scale expression capability;

s306, training the constructed hole characteristic pyramid network by using the obtained training set, and adopting an approximate joint training strategy: the network model trains 100K iterations totally, the learning rate of the first 60K iterations is 10-3, the learning rate of the next 20K iterations is 10-4, and the weight attenuation and momentum are 0.00004 and 0.9 respectively;

The method provided by the invention is compared with the detection result of the existing method for analysis, so that the advantages and the disadvantages of the model are further analyzed.

The AP for a single class target and the mep values for all class targets for each method are shown in table 1.

TABLE 1 comparison of test results on NWPU VHR-10 dataset

As can be seen from Table 1, the method of the present invention achieves 96.89% mAP on the NWPU VHR-10 dataset, which is improved by about 10% compared with the original fast R-CNN method, and in addition, the detection result of the hole feature pyramid network of the present invention is also superior to FPN, which is improved by about 3.8%, and the detection result of the hole feature pyramid network on the NWPU VHR-10 dataset is also improved by 14.5% and 5.6% respectively compared with the SSD of the one-stage detector.

Table 2 is a comparison table of the test results on RSOD dataset and table 3 is a comparison table of the test results on PASCAL VOC dataset. As can be seen from tables 2 and 3, the detection results of the method of the present invention are also superior to other methods in terms of RSOD data set and PASCAL VOC data set.

Table 2 comparison of test results on RSOD dataset

TABLE 3 comparison of test results on PASCAL VOC data set

The detection effect diagram of the hole characteristic pyramid network is shown in fig. 4, and the method provided by the invention improves the detection accuracy of targets with multiple sizes and complex appearances in a data set, such as oil storage tanks, bridges, playgrounds and the like.

The invention relates to a method for detecting an optical remote sensing image target based on a void feature pyramid network, which solves the problem of scale change of target detection in an optical remote sensing image and improves the detection precision of a multi-scale target in the image; constructing a characteristic pyramid network with multi-scale characteristic expression capacity through a cavity characteristic fusion module AFFM; in the hole feature fusion module AFFM, a multi-branch hole convolution module is used to fully utilize image feature information.

The hole characteristic pyramid network not only achieves 96.89% mAP on a NWPU VHR-10 data set, but also can achieve good detection performance on other optical remote sensing image data sets and natural image data sets. Experimental results show that the method not only can improve the detection performance of the remote sensing image, but also can well process the natural image, and the method has good generalization performance.

Although the embodiments of the present invention and the accompanying drawings are disclosed for illustrative purposes, those skilled in the art will appreciate that: various substitutions, changes and modifications are possible without departing from the spirit and scope of the invention and the appended claims, and therefore the scope of the invention is not limited to the disclosure of the embodiments and the accompanying drawings.

Claims

1. A method for detecting an optical remote sensing image target based on a hole characteristic pyramid network is characterized by comprising the following steps: the method comprises the following steps:

s301, constructing a hole characteristic pyramid network, selecting ResNet-101 as a basic network of a target detection network, extracting characteristic graphs with different scales by using residual blocks by the ResNet-101, extracting the output of the last residual structure of the following four rolling blocks from the ResNet-101 basic network as basic characteristics, and representing the basic characteristic graphs as { C }₂，C₃，C₄，C₅}；

S302, in a hole feature fusion module AFFM of the pyramid network, C₂Reducing the number of feature channels to 256 dimensions by Conv1 × 1, { C₃，C₄，C₅Respectively interpolating each feature map bilinearly to C by an upsampling operation₂Reducing the channel number of the feature map after upsampling to 256 dimensions by Conv1 multiplied by 1 operation, then connecting the obtained feature maps in series by Concat operation to obtain multi-level fusion features, and then reducing the feature dimensions of the multi-level fusion features to 256 dimensions by applying Conv1 multiplied by 1;

s303, constructing a cavity transverse connection module, enabling three branches of the cavity transverse connection module to have different-size receptive fields by adopting three convolution operations of Conv1 multiplied by 1, Conv3 multiplied by 3 and Conv5 multiplied by 5, adding Conv3 multiplied by 3 operations with different cavity rates behind the three branches, and splicing a feature map generated by each branch and a Global Average Pooling branch together through a Concat operation so as to obtain a transverse connection feature map with stronger multi-scale expression capability;

s304, the fused feature maps generate a plurality of groups of feature maps with different scales from bottom to top through a multilayer downsampling and hole transverse connection module, and the feature maps are represented as { P }₂，P₃，P₄，P₅Are respectively corresponding to { C }₂，C₃，C₄，C₅Integrating the characteristics generated by the multi-level fusion characteristics and the characteristics generated by the cavity transverse connection module by using a channel series operation, and obtaining { P by using a Conv1 multiplied by 1 operation₂，P₃，P₄，P₅The computation function of these profiles is:

ALCB(C_i) Performing a multi-branch convolution operation function with convolution kernels of different sizes and a void rate;

s305, { P ] generated by the hole feature fusion module₂，P₃，P₄，P₅Inputting the data into an area generation network behind the network model and a detection network header to further perform generation of a candidate area and calculation of a detection result;