CN116883859A

CN116883859A - Remote sensing image target detection method based on YOLOv7-RS

Info

Publication number: CN116883859A
Application number: CN202310818961.2A
Authority: CN
Inventors: 梁琦; 曹亚明; 杨晓文; 薛红新; 贾彩琴; 郭磊; 孙福盛; 焦世超; 赵融
Original assignee: North University of China
Current assignee: North University of China
Priority date: 2023-07-05
Filing date: 2023-07-05
Publication date: 2023-10-13

Abstract

The invention belongs to the technical field of computer vision, and particularly relates to a remote sensing image target detection method based on YOLOv 7-RS. In order to improve the accuracy of target detection in a remote sensing image, the invention designs a remote sensing image target detection network based on YOLOv7-RS, a D-ELAN module is redesigned in the network, a SimAM attention mechanism is fused in a backbone network, a SIOU loss function is used for replacing the CIOU loss function, and a positive and negative sample distribution strategy is optimized.

Description

Remote sensing image target detection method based on YOLOv7-RS

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a remote sensing image target detection method based on YOLOv 7-RS.

Background

With the continuous development of remote sensing technology, remote sensing image target detection has become an important research direction in the field of remote sensing image interpretation. The target detection based on the remote sensing image has important significance in the fields of military operations, national defense safety and the like, the target detection accuracy is improved, the focused target information can be quickly detected in a large amount of image data, and the intelligence searching capability is improved.

In recent years, the rapid development of deep learning provides an advantageous technical support for the feature extraction of remote sensing images. The target detection method based on deep learning mostly uses the convolutional neural network as a main network, because the convolutional neural network can automatically extract high-level semantic features, compared with the traditional manual feature extraction method, the method has stronger feature representation capability, and meanwhile, the capability of the convolutional neural network to actively learn features has stronger advantages in big data age. The rapid development of convolutional neural networks solves a plurality of problems in the field of computer vision, and has great success in the field of image target detection. However, because the remote sensing image target has the characteristics of multiple scales, multiple rotation angles, complex scenes and the like, under the condition of limited high-quality mark samples, deep learning still faces a great challenge in remote sensing image target detection application.

The current target detection algorithm based on deep learning is mainly divided into double-stage target detection and single-stage target detection. The YOLO series algorithm is a typical single-stage object detection algorithm. YOLOv1 was first proposed in 2015, and the problem of low network reasoning speed in two-stage detection is effectively solved. YOLO v2 improves from three angles of faster, more, and more accurate, and recognition objects are also expanded to 9000, and thus is also called YOLO9000.YOLOv3 introduces a feature pyramid FPN and a residual error module Darknet-53, supports detection of three different-scale object detection, and realizes multi-scale fusion. The popular technologies such as YOLOv4 and YOLOv5 combined with Weighted Residual Connection (WRC), cross-phase partial Connection (CSP), mosaic data enhancement and the like further improve the detection precision and speed. The YOLOX combines an Anchor-Free network to replace a YOLOv5 coupling detection head with a decoupling detection head, so that the convergence rate of the network is improved, and a positive and negative sample matching strategy SimOTA is provided on the basis of OTA. YOLOv6 is a target detection framework for research and development optimization of the visual intelligence of the beauty community, and is widely applied in the industry. The generation of YOLOv7 in 2022 is 7, an E-ELAN architecture and an auxiliary training module are provided for network performance, and the speed and the accuracy of an algorithm are further improved.

Many achievements and developments have been made by existing research efforts, but there are also problems that need to be further studied and addressed: the object detection algorithm based on the YOLO is good in natural images, but the remote sensing image has the characteristics of complex and various backgrounds and large scale difference due to different imaging modes of the remote sensing image and the natural images, so that the object detection effect of the remote sensing image based on the YOLO is poor.

Disclosure of Invention

The invention aims to solve the problem of poor detection precision of a remote sensing image target, provides a remote sensing image target detection method based on YOLOv7-RS, designs a remote sensing image target detection network based on YOLOv7-RS, and can efficiently solve the problem existing in remote sensing image target detection.

In order to achieve the above purpose, the invention adopts the following technical scheme:

a remote sensing image target detection method based on YOLOv7-RS comprises the following steps:

step 1, acquiring a remote sensing image and preprocessing the remote sensing image;

step 2, constructing a remote sensing image target detection model based on a YOLOv7-RS network structure;

and step 3, inputting the preprocessed remote sensing image and the weight file into the constructed model, and detecting the target of the remote sensing image.

Further, in the step 1, the remote sensing image is preprocessed, specifically: and scaling the acquired remote sensing image to 640X640, wherein the insufficient part is supplemented by adopting a pixel filling mode.

Further, the YOLOv7-RS network structure in the step 1 includes a D-ELAN module, a SIOU loss function part, an Input stage, a Backbone network (Backbone) stage, a Neck network (neg) stage, and a Head network (Head) stage;

the D-ELAN module is designed according to the split gradient flow idea of CSPNet: i.e. the first branch is directly convolved with a 1x1 convolution; the second branch passes through three groups of two 3x3 convolutions on the basis, and finally, the convolution of 1x1 and the convolution results of three groups of 3x3 are spliced, so that the feature extraction capability is improved in a mode of improving the block utilization rate and increasing the network depth.

Still further, in the backbone network stage, a three-dimensional attention module SimAM that merges channel attention and spatial attention is introduced, and the calculation formula is as follows:

wherein ,for output features, X is the input feature, E represents the channel and spatially all neurons minimum energy functionEnergy tensor of individual neurons minimum energy function +.>As shown in formula (2):

wherein t is a target neuron, lambda is a super parameter,is the average value of all neurons on a single channel,/->Is the variance of all neurons on a single channel, as shown in equations (3) and (4):

wherein M represents the number of neurons per channel, X _i Representing the ith neuron of the input signature on a single channel.

Still further, in the SIOU loss function, an Angle deviation between a real frame and a predicted frame is defined as an Angle loss, and calculation of a Distance loss is added, namely, the SIOU loss function consists of four parts of an Angle loss (Angle cost), a Distance loss (Distance cost), a Shape loss (Shape cost) and an IOU loss, and a calculation formula is as follows:

wherein, IOU is IOU loss, delta is distance loss, omega is shape loss, and the calculation formulas of the three are as follows:

wherein Λ is the angle loss, C _h and C_w The height and width of the minimum circumscribed rectangle of the real frame and the predicted frame are respectively, gamma is given as a distance value with limited time, ρ _x Is the difference between the width of the real frame and the width of the predicted frame is C _w Specific gravity ρ of (B) _y Is the difference between the height of the real frame and the height of the predicted frame is C _h Specific gravity of x ^gt and y^gt Respectively the abscissa of the center point of the real frame, x and y are the abscissa of the center point of the predicted frame, sigma is the distance between the center points of the real frame and the predicted frame, and w ^gt and h^gt For the width and height of the real frame, w and h are the width and height of the predicted frame, ω _w Is the specific weight, omega of the difference between the width of the real frame and the width of the predicted frame in the maximum value of the two _h θ is a degree of attention for controlling the shape loss, which is the specific gravity of the difference between the height of the real frame and the height of the predicted frame in the maximum value of the two.

Still further, only positive samples in the SIOU loss function participate in the calculation of the loss function.

Furthermore, the positive and negative sample distribution strategy is optimized in the SIOU loss function, namely, on the basis of the positive and negative sample distribution strategy of the Yolov7, the rotation invariance of the remote sensing image target is comprehensively considered, and three positive sample candidate frames are added to four positive sample candidate frames.

Compared with the prior art, the invention has the following advantages:

(1) In order to improve the defect of the capability of the YOLOv7 network for extracting the characteristics of the remote sensing image, a D-ELAN module is redesigned.

(2) In order to reduce the interference of background noise in the remote sensing image, a SimAM attention mechanism is fused in the YOLOv7 network, so that the network can pay attention to more valuable information in the remote sensing image.

(3) To increase the convergence speed of the network, the CIOU loss function is replaced with the SIOU loss function.

(4) In order to solve the problem of missed detection when small targets are densely arranged in a remote sensing image, a positive and negative sample distribution strategy is optimized.

(5) The YOLOv7-RS provided by the invention is superior to most of the existing methods, has competitive detection capability on an NWPU VHR-10 data set and a DOTA data set, can be well adapted to the complexity and diversity of remote sensing images, and shows the effectiveness of the method.

Drawings

FIG. 1 is a flow chart of a remote sensing image target detection method based on YOLOv 7-RS;

FIG. 2 is a network structure diagram of a remote sensing image target detection method based on YOLOv7-RS of the invention;

FIG. 3 is a block diagram comparison of the D-ELAN module of the present invention and the ELAN module of Yolov 7;

FIG. 4 is a diagram showing parameters of a real frame and a predicted frame in a SIOU according to the present invention;

FIG. 5 is a schematic diagram of positive and negative sample strategy optimization;

FIG. 6 is a comparison of the results of the visualization of the NWPU VHR-10 dataset;

fig. 7 is a comparison of DOTA dataset visualization.

Detailed Description

The present invention will be described more fully hereinafter in order to facilitate an understanding of the present invention. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete.

Example 1

As shown in FIG. 1, the method for detecting the target of the remote sensing image based on the YOLOv7-RS comprises the following steps:

step 1, acquiring a remote sensing image and preprocessing the remote sensing image, namely scaling the image to 640x640;

step 2, constructing a remote sensing image target detection model based on a Yolov7-RS network structure (shown in figure 2), wherein the Yolov7-RS network structure comprises a D-ELAN module, a SIOU loss function part, an input end stage, a backbone network stage, a neck network stage and a head network stage;

wherein the D-ELAN (deep-ELAN) module is designed according to the split gradient flow idea of CSPNet: the first branch is directly convolved with a 1x1 convolution; the second branch passes through three groups of two 3x3 convolutions on the basis, and finally, the convolution of 1x1 and the convolution results of three groups of 3x3 are spliced, so that the feature extraction capability is improved in a mode of improving the block utilization rate and increasing the network depth. The D-ELAN module and the pair of block diagrams of the ELAN module are shown in FIG. 3.

In the backbone network stage, a three-dimensional attention module SimAM which fuses channel attention and space attention is introduced, and the calculation formula is as follows:

In the SIOU loss function, the angle deviation between the real frame and the predicted frame is defined as an angle loss, and the calculation of the distance loss is added, namely the SIOU loss function consists of four parts of the angle loss, the distance loss, the shape loss and the IOU loss, wherein the calculation formula is as follows:

wherein Λ is the angle loss, C _h and C_w The height and width of the minimum circumscribed rectangle of the real frame and the predicted frame are respectively, gamma is given as a distance value with limited time, ρ _x Is the difference between the width of the real frame and the width of the predicted frame is C _w Specific gravity ρ of (B) _y Is the difference between the height of the real frame and the height of the predicted frame is C _h Specific gravity of x ^gt and y^gt Respectively the abscissa of the center point of the real frame, x and y are the abscissa of the center point of the predicted frame, sigma is the distance between the center points of the real frame and the predicted frame, and w ^gt and h^gt For the width and height of the real frame, w and h are the width and height of the predicted frame, ω _w Is the specific weight, omega of the difference between the width of the real frame and the width of the predicted frame in the maximum value of the two _h For the specific gravity of the difference between the height of the real frame and the height of the predicted frame in the maximum value of the two, theta is the attention degree of controlling the shape loss, and the parameter range is [2,6]The schematic diagrams of the parameters are shown in fig. 4, wherein the lower left side is a prediction frame, and the upper right side is a real frame.

Only positive samples in the SIOU loss function participate in the calculation of the loss function, and the problem of missed detection in dense arrangement of small targets in the remote sensing image is solved by optimizing a positive and negative sample distribution strategy. The positive and negative sample distribution strategy is optimized by comprehensively considering the rotation invariance of the remote sensing target on the basis of the positive and negative sample distribution strategy of YOLOv7, adding three positive sample candidate frames to four positive sample candidate frames, and calculating to reduce the positive sample loss rate from 46% to 28% under the condition that the remote sensing image rotates by 45 degrees as shown in fig. 5.

And 3, inputting the preprocessed remote sensing image and the weight file into a constructed model, and detecting the target of the remote sensing image, wherein the weight file in the embodiment is an optimal weight file obtained by performing iterative training on the NWPU VHR-10 data set and the DOTA data set for 300 epochs.

Example 2

The remote sensing image target detection network model provided by the invention is applied to the NWPU VHR-10 data set, and experiments show that the network model is effective.

NWPU VHR-10 data sets were published by the northwest industrial university in 2014, with images extracted from Google Earth and Vaihingen, including 10 categories of aircraft (PL), ship (SH), tank (ST), baseball field (BD), tennis Court (TC), basketball Court (BC), ground Track (GTF), port (HA), bridge (BR), and Vehicle (VE), 800 telemetry images (containing 150 background images). The data annotation adopts an HBB (Horizontal Bounding Boxes, horizontal bounding box) annotation format, and total 3651 examples. 90% was randomly divided from the dataset as training set and 10% as test set.

mAP (MeanAverage Precision) is generally used for measuring the overall performance of the model in a target detection task; mAP is the average of the average accuracy AP (Avanrage Precision) of the multiple categories in the dataset; each category can draw a curve in the coordinates from 0 to 1 according to Precision and Recall, and the area enclosed by the curve and the coordinate axis is the average Precision, as shown in the formula (10):

wherein, accuracy Precision represents the proportion of TP in the positive sample predicted in the detector, as shown in formula (11); recall represents the ratio of correctly predicted positive samples to total number of samples in the detector, as shown in equation (12):

where TP is the true case, FN is the false case, and FP is the false case.

The YOLOv7-RS proposed by the invention is experimentally compared with SSD, faster R-CNN, YOLOv3, YOLOv5s, YOLOv7 algorithm on NWPU VHR-10 dataset, and the results are shown in Table 1.

TABLE 1 experimental results of different algorithms on NWPU VHR-10 dataset

As shown in Table 1, YOLOv7-RS showed an improvement of 14.3%, 10.9%, 20.3%, 6.3%, 5.3% and 2.6% in terms of SSD, faster R-CNN, YOLOv3, YOLOv4, YOLOv5s and YOLOv7, respectively. The detection precision of the YOLOv7-RS in each category is above 89%, the overall detection precision is good, and in the target detection of an airplane (PL) and a Storage Tank (ST), the precision is optimal to 99.6% compared with other algorithms; compared with original YOLOv7, the detection accuracy of the aircraft (PL), the Storage Tank (ST), the Tennis Court (TC), the Basketball Court (BC) and the Vehicle (VE) is improved.

The detection results of the Yolov7 and the Yolov7-RS are compared through a large number of experiments, and two groups of visual results are selected for analysis, as shown in FIG. 6, the left side is the detection result of the Yolov7 algorithm, and the right side is the detection result of the Yolov7-RS algorithm.

In fig. 6 (a) the yellow landmark is erroneously detected as an aircraft, and in fig. 6 (b), the bridge is missed. The YOLOv7-RS can accurately detect the target, and the detection effect of the YOLOv7-RS under a complex background is effectively improved.

Example 3

The remote sensing image target detection network model provided by the invention is applied to the DOTA data set, and the validity of the network model is shown through experiments.

The dotav1.0 data set is from google earth, chinese resource satellite data and application center provided GF-2 and JL-1 satellite images, and cyclimedia b.v provided aerial images, including aircraft (PL), ship (SH), small Vehicle (SV), large Vehicle (LV), storage Tank (ST), tennis Court (TC), playground runway (GTF), bridge (BR), loop (RA), swimming Pool (SP), baseball field (BD), basketball Court (BC), harbour (HA), helicopter (HC) and soccer field (SBF) 15 categories, 2806 aerial images from different sensors and platforms, the image sizes varying from 800x800 to 4000x4000, for a total of 188282 examples. In the embodiment, the DOTA_devkit is adopted to preprocess the data set of the HBB labeling mode, the original image is cut into sub-images with 1024x1024 and overlapping pixels of 200, and the image with the resolution not reaching the specified pixels after cutting is filled in a pixel filling mode. The training set after treatment has 15749 pictures, and the test set has 5297 pictures.

The results of experimental comparison of the proposed YOLOv7-RS with SSD, fast R-CNN, YOLOv3, YOLOv5s, YOLOv7 algorithm on NWPU VHR-10 dataset are shown in table 2.

Table 2 comparison of results of different models on DOTA dataset

As can be seen from Table 2, YOLOv7-RS showed 21.7%, 32.1%, 9.6%, 5.7%, 4.6% and 2.4% improvement in mAP over SSD, faster R-CNN, YOLOv3, YOLOv4, YOLOv5s and YOLOv7, respectively. The detection precision of the YOLOv7-RS in a baseball field (BD), a Bridge (BR), a Large Vehicle (LV), a football field (SBF) and a loop (RA) is optimal compared with other algorithms; compared with the original YOLOv7, the detection accuracy of the three categories of the Tennis Court (TC), the Basketball Court (BC) and the oil Storage Tank (ST) is reduced by 0.1-0.2%, and the rest of the detection accuracy is obviously improved.

The detection results of the Yolov7 and the Yolov7-RS are compared through a large number of experiments, and two groups of visual results are selected for analysis, as shown in fig. 7, the left side is the detection result of the Yolov7 algorithm, and the right side is the detection result of the Yolov7-RS algorithm.

YOLOv7 detected 5 ports in fig. 7 (a), 153 carts and 3 carts in fig. 7 (b). The YOLOv7-RS detects 5 ports and 6 cars in fig. 7 (a), and 272 cars and 4 large cars in fig. 7 (b), so that the problem of missed detection under the condition of complex background and dense arrangement of small targets is effectively improved by the YOLOv 7-RS.

In conclusion, the YOLOv7-RS provided by the invention is superior to most of the existing methods, mAP on NWPU VHR-10 and DOTA data sets reaches 95.4% and 74.1%, and the method can be well adapted to complexity and diversity of remote sensing images, and shows the effectiveness of the method.

Claims

1. A remote sensing image target detection method based on YOLOv7-RS is characterized by comprising the following steps:

2. The YOLOv 7-RS-based remote sensing image target detection method according to claim 1, wherein the preprocessing of the remote sensing image in step 1 is specifically as follows: and scaling the acquired remote sensing image to 640X640, and supplementing the insufficient part by adopting a pixel filling mode.

3. The method for detecting the target of the remote sensing image based on the YOLOv7-RS according to claim 1, wherein the YOLOv7-RS network structure in the step 1 comprises a D-ELAN module, a SIOU loss function part, an input end stage, a backbone network stage, a neck network stage and a head network stage;

the first branch of the D-ELAN module directly passes through a convolution of 1x 1; the second branch passes through three groups of two 3x3 convolutions on the basis, and finally, the convolution of 1x1 and the convolution results of three groups of 3x3 are spliced, so that the feature extraction capability is improved in a mode of improving the block utilization rate and increasing the network depth.

4. A remote sensing image target detection method based on YOLOv7-RS according to claim 3, wherein in the backbone network stage, a three-dimensional attention module SimAM which merges channel attention and spatial attention is introduced, and the calculation formula is as follows:

wherein ,for the output features, X is the input feature, E represents the channel and spatially all neurons minimum energy function +.>Energy tensor of individual neurons minimum energy function +.>As shown in formula (2):

wherein M represents the number of neurons per channel，X _i Representing the ith neuron of the input signature on a single channel.

5. A remote sensing image target detection method based on YOLOv7-RS according to claim 3, wherein in the SIOU loss function, an angle deviation between a real frame and a predicted frame is defined as an angle loss, and calculation of a distance loss is added, that is, the SIOU loss function is composed of four parts of angle loss, distance loss, shape loss and IOU loss, and the calculation formula is as follows:

6. A YOLOv 7-RS-based remote sensing image target detection method according to claim 3, wherein only positive samples in the SIOU loss function participate in the calculation of the loss function.

7. The YOLOv 7-RS-based remote sensing image target detection method according to claim 6, wherein positive and negative sample allocation strategies are optimized in the SIOU loss function, namely three positive sample candidate boxes are added to four positive sample candidate boxes on the basis of the positive and negative sample allocation strategies of YOLOv7 by comprehensively considering rotation invariance of the remote sensing image target.