CN116343027A

CN116343027A - YOLOv5 remote sensing image target detection method utilizing attention mechanism fusion

Info

Publication number: CN116343027A
Application number: CN202310177081.1A
Authority: CN
Inventors: 王龙博; 刘建辉; 张贝贝; 江刚武; 麻顺顺; 魏祥坡
Original assignee: Information Engineering University of PLA Strategic Support Force
Current assignee: Information Engineering University of PLA Strategic Support Force
Priority date: 2023-02-28
Filing date: 2023-02-28
Publication date: 2023-06-27

Abstract

The invention provides a method for detecting a target of a YOLOv5 remote sensing image by using attention mechanism fusion, which comprises the following steps: step 1: constructing a remote sensing image target detection network, wherein the remote sensing image target detection network comprises a fusion attention mechanism at any one position of a backbone layer, a neck layer and an output end; the fusion attention mechanism in the diaphysis layer of YOLOv5 specifically includes: adding an attention module after each CSP structure in the backbone layer; step 2: designing a loss function, and training the remote sensing image target detection network based on the loss function to obtain a remote sensing image target detection network model; step 3: and inputting the remote sensing image to be detected into the trained target detection network model to obtain a detection result.

Description

YOLOv5 remote sensing image target detection method utilizing attention mechanism fusion

Technical Field

The invention relates to the technical field of image encryption, in particular to a method for detecting a target of a YOLOv5 remote sensing image by means of attention mechanism fusion.

Background

The detection and identification of targets in remote sensing images by using target detection technology has become a current research hotspot. Under the challenge of mass data, the difficulty of meeting the accuracy and timeliness of target detection is more and more increased. The traditional remote sensing image target detection algorithm depends on manual work, has the problems of low detection instantaneity, low detection precision and the like, and is difficult to meet the actual application demands. Thus, with the intensive research of the deep learning technology, the target detection technology has been gradually combined with the deep learning technology from the conventional technology relying on a large number of manpower. From the flow of algorithm processing, the target detection algorithm based on deep learning mainly comprises two types: two-stage and single-stage detection algorithms. The method is characterized in that a region to be detected is established firstly, then, detection and judgment are carried out on a target, so that the algorithm detection precision is high, the method is suitable for scenes such as high-precision detection, but the timeliness of the algorithm is poor due to excessive model parameters and complex construction process, and typical algorithms include R-CNN, fastR-CNN, fasterR-CNN and the like. The single-stage target detection algorithm completes the generation, classification and regression of the region to be detected in one step, so that the algorithm has high instantaneity, is suitable for scenes such as real-time target detection, and represents algorithms such as SSD, YOLO series and the like. YOLOv5 refers to a design method of YOLOv4, optimizes by adopting a lighter network design, a self-adaptive anchoring method and a GIoU loss function, is a relatively perfect single-stage detection algorithm at present, and has both detection efficiency and accuracy. However, in existing detection tasks, the algorithm still faces a number of problems. For example, due to the problems of complex background, different scales, mutual shielding and the like of the remote sensing image target, the difficulty of a detection task is greatly increased, and the detection precision of an algorithm is limited, so that a plurality of scholars improve the YOLOv5 algorithm.

For example, document 1 (Tianheng, wang Ling, wang Peng, etc.. Computer engineering and applications based on the objective detection algorithm study [ J ] of improved Yolov5, 2022, 58 (13): 63-73) proposes a lightweight improved model Yolov-G, which has improved detection performance by integrating the attention mechanism of parallel mode into the backbone network by improving the feature pyramid structure of Yolov 5. Document 2 (Zhao Rui, liu Hui, liu Peilin, etc.. Safety helmet detection algorithm based on improved YOLOv5s [ J/OL ]. University of Beijing aviation aerospace university report: 1-16[2022-10-02]. DOI: 10.13700/j.bh.1001-5965.2021.0595.) uses a Denseblock module to replace the slice structure in the YOLOv5 backbone network, and a SE-Net channel attention module is added at the neck, improving the detection capability of the algorithm for a target dense distribution scene. The scholars all carry out the improvement of the added attention mechanism on the YOLOv5 algorithm, the detection precision under partial scenes is effectively improved, but the requirements of the target detection field on rapidness and accuracy are still difficult to meet, and the inventor believes that the core problem of the document is to neglect the influence of different positions of a network structure on the fused attention mechanism.

Disclosure of Invention

In order to further meet the high requirements of the target detection field on detection efficiency and detection accuracy, the invention provides a method for detecting a target of a YOLOv5 remote sensing image by using attention mechanism fusion.

The invention provides a method for detecting a target of a YOLOv5 remote sensing image by using attention mechanism fusion, which comprises the following steps:

step 1: constructing a remote sensing image target detection network, wherein the remote sensing image target detection network comprises a fusion attention mechanism at any one position of a backbone layer, a neck layer and an output end; the fusion attention mechanism in the diaphysis layer of YOLOv5 specifically includes: adding an attention module after each CSP structure in the backbone layer;

step 2: designing a loss function, and training the remote sensing image target detection network based on the loss function to obtain a remote sensing image target detection network model;

step 3: and inputting the remote sensing image to be detected into the trained target detection network model to obtain a detection result.

Further, in step 1, the fusion attention mechanism in the neck layer of YOLOv5 specifically includes: three Concat structures are selected in the neck layer, and one attention module is added before or after each Concat structure is selected.

Further, in step 1, the fused attention mechanism in the output terminal of YOLOv5 specifically includes: one attention module is added after each Conv layer at the output.

Further, the attention mechanism or module employs any one of CA, SE, ECA and CBAM attention.

Further, in step 2, the ciou_loss function is used as a Loss function.

The invention has the beneficial effects that:

the detection precision of the whole method can be effectively improved by fusing an attention mechanism in YOLOv5, especially fusing a CA attention mechanism in a backbone layer;

based on the fusion attention mechanism, the CIOU_LOSS LOSS function is adopted, and the improvement of the CIOU_LOSS LOSS function and the CIOU_LOSS LOSS function are combined, so that the number of false detection and missed detection in the target detection task can be further effectively reduced, the positioning accuracy of the target bounding box is improved, and the detection speed can be improved.

Drawings

Fig. 1 is a schematic flow chart of a method for detecting a target of a YOLOv5 remote sensing image by using attention mechanism fusion according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a mechanism for fusing CA attention at different locations of Yolov5s according to an embodiment of the present invention: (a) is a backbone layer position at YOLOv5s; (b) is the neck layer location at YOLOv5s; (c) is the output position at YOLOv5s;

FIG. 3 is a diagram showing the positional relationship between a predicted frame and a real frame according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a real frame including a prediction frame according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a CIoU loss function according to an embodiment of the present invention;

fig. 6 is a schematic diagram of examples of different targets in an RSOD dataset according to an embodiment of the present invention: (a) an aircraft image; (b) oil drum images; (c) an overpass image; (d) a playground image;

fig. 7 is a visual comparison of detection results of target detection by using the method of the present invention and the existing method according to the embodiment of the present invention: (a) is SSD; (b) is YOLOv3; (c) is YOLOv5s; (d) is the Yolov5s_CA_CIoU of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions in the embodiments of the present invention will be clearly described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1

As shown in fig. 1, an embodiment of the present invention provides a method for detecting a target of YOLOv5 remote sensing image by using attention mechanism fusion, which includes the following steps:

s101: constructing a remote sensing image target detection network, wherein the remote sensing image target detection network comprises a fusion attention mechanism at any one position of a backbone layer, a neck layer and an output end of the YOLOv 5;

specifically, due to the complexity and diversity of the remote sensing image, the YOLOv5 algorithm is directly applied to the target detection task, so that the conditions that dense targets are difficult to detect, the positioning accuracy of multi-scale targets is low, small targets are easy to miss detection and error detection and the like can occur, the effectiveness of the target detection model is greatly reduced, and therefore, further structural optimization and adjustment of the YOLOv5 network are required. The YOLOv5 includes 4 versions in total, and in this embodiment, a basic YOLOv5s version is adopted to construct a remote sensing image target detection network.

The attention mechanism mainly acts on the feature map, so that the feature extraction capability of the network can be effectively improved by fusing the attention mechanism at a proper position of the network. However, since the backbone layer, the neck layer and the output end of YOLOv5s respectively perform different operation treatments on the characteristics, the improvement effect brought by fusing attention to different positions of the YOLOv5s network is also different. Meanwhile, since the input end of YOLOv5s is considered to perform operations such as data preprocessing, and the like, and is irrelevant to the extraction or fusion of target features, the fusion design of the attention mechanism in the input end is not considered in the embodiment.

Further, there are various types of attention mechanisms, any of CA, SE, ECA and CBAM attention can be employed. In order to maximize the performance of the remote sensing image target detection network, the CA attention is preferably fused in the Yolov5s network in the embodiment. For convenience of description, a network in which a CA attention mechanism is fused in a BackBone layer of Yolov5s is referred to as a Yolov5s_BackBone_CA model; the network after the CA attention mechanism is fused in the Neck layer of Yolov5 is named as a Yolov5s_Neck_CA model; the network after the CA attention mechanism is fused in the output of Yolov5 is denoted as the Yolov5s_prediction_CA model.

The CA attention mechanism is fused in the backbone layer of the Yolov5s, and specifically comprises the following steps: one CA attention module is added after each CSP structure in the backbone layer, as shown in FIG. 2 (a).

Among these, the fusion of CA attention mechanisms in the cervical layer of Yolov5 specifically includes: three Concat structures are selected in the neck layer, and one CA attention module is added before or after each Concat structure is selected, as shown in fig. 2 (b).

The mechanism of merging CA attention in the output end of Yolov5 specifically comprises: a CA attention module is added after each Conv layer at the output as shown in fig. 2 (c).

S102: designing a loss function, and training the remote sensing image target detection network based on the loss function to obtain a remote sensing image target detection network model;

in this embodiment, the original LOSS function giou_loss of YOLOv5s is used as the LOSS function. The formula is shown as formula (1).

In the formula, A, B represents a prediction frame and a real frame, C represents the minimum bounding boxes of a and B, and the specific positional relationship is shown in fig. 3.

S103: and inputting the remote sensing image to be detected into the trained target detection network model to obtain a detection result.

In order to verify the detection performance of the remote sensing image target detection network provided by the embodiment of the invention, the following experimental data are also provided.

To explore the difference in improvement effect caused by the three fusion models, a comparison experiment was performed on the obtained attention fusion model and the original YOLOv5s model by using the RSOD dataset, and the experimental results are shown in table 1. The definition of the evaluation index in table 1 is shown in example 3, and will not be repeated here.

TABLE 1 results of CA attention module fusion experiments

As shown in Table 1, because the CA attention is fused in the BackBone layer, the contour information and the positioning information of the target can be fully utilized, thereby more effectively inhibiting the complex background information in the feature mAP, the mAP50 of the YOLOv5s_BackBone_CA model is far higher than that of the other two fusion models, and is improved by 2.5% compared with the original YOLOv5s model, which indicates that the improvement effect brought by the fusion of the CA attention in the BackBone layer is best, and the detection precision is greatly improved. Compared with the original YOLOv5s model, mAP50 fused with CA attention at the neck layer is improved by 1.1%, which shows that the fused CA attention at the neck layer can effectively enhance the feature extraction capability of the network, but the improvement effect brought by the fused CA attention at the neck layer is lower than that of the backbone layer because the feature is transmitted and fused at the neck layer, so that part of information is lost. As the extraction and fusion operations are completed when the features enter the output end, the receptive field of the features is reduced, semantic information is lost, and the Yolov5s_BackBone_CA model obtained by fusing CA attention at the output end is reduced by 2.1% compared with the original Yolov5s model, but the accuracy is improved. From this, it can be seen that: in this embodiment, the fusion of CA attention is best at the Yolov5s backbone layer position.

Based on the yolov5s Backbone layer position fusion attention module, to further verify the effectiveness of CA attention fusion in the yolov5s Backbone layer, the CA attention in the model was replaced with SE, ECA and CBAM attention, respectively, to obtain three new models yolov5s_backbone_cbam, yolov5s_backbone_se, yolov5s_backbone_eca, and comparative experiments were performed on the RSOD dataset, and the experimental results are shown in table 2.

Table 2 results of comparative experiments fusing different attention modules

As shown in Table 2, the improved models obtained by respectively fusing SE, ECA, CBAM attention at the backbone layer of Yolov5s are improved in mAP50, and the P, R and mAP50 indexes fused with the CA attention model are the highest, so that the effectiveness of fusing CA attention at the backbone layer of the network is proved.

Example 2

The giou_loss introduces a minimum circumscribed rectangle on the basis of IoU _loss, but since giou_loss only considers the coincidence degree between the real frame and the predicted frame, the regression relation of the target frame cannot be well described. On the other hand, when the target prediction frame is located within the range of the real frame, i.e., B n a=a, the giou_loss cannot accurately identify the position of the different prediction frame, as shown in fig. 4.

Therefore, on the basis of the above embodiment 1, the embodiment of the present invention is different from the above embodiment 1 in that the present embodiment modifies the Loss function, and a more perfect ciou_loss is selected as the Loss function of YOLOv5 s.

The CIoU_Loss solves the problems of the GIoU_Loss, considers the scale information of the boundary frame, increases the scale and the length-width ratio Loss of the detection frame, ensures that the prediction frame is more in line with the real frame, and realizes the effective fitting of the prediction frame and the real frame. Ciou_loss is shown in fig. 5.

The CIoU_Loss is calculated as follows.

Wherein:

and->

The aspect ratios of the target frame and the prediction frame are respectively represented.

Based on the multidimensional consideration, the CIoU_LOSS improves the model positioning precision while increasing the boundary regression performance, so that the regression effect of the prediction frame is better, the convergence speed is increased, and the robustness for multi-scale target detection is enhanced.

Example 3

The present embodiment uses the RSOD data set to train and test the method of the present invention. First, the effectiveness of two improvement points (one of which is a fused attention mechanism and the other of which is a modified loss function) was evaluated by an ablation experiment. The method of the invention was then compared to SSD, YOLOv3 and original YOLOv5s algorithms and part of the test results were selected for visualization to verify the effectiveness of the method of the invention.

Experimental data and environment

The experiment used a RSOD dataset containing images of different scale features, which contained a total of 2326 images of four classes of targets. Wherein fig. 6 (a) is an aircraft target; FIG. 6 (b) shows oil drum targets, which are closely arranged and of different sizes; FIG. 6 (c) is an overpass target, and it can be seen that the background information of the overpass is more complex; fig. 6 (d) is a playground target image.

The experiment uses Windows 10-64-bit operating system, the GPU is GeForce RTX 3080Ti, python3.8 version is selected, the programming platform uses Pycharm, and the deep learning framework is Pytorch1.8.0 and CUDA11.1. The iteration number Epoch was set to 150 times, the Batch Size was set to 16, and the specific experimental environment configuration is shown in table 3.

TABLE 3 Experimental Environment

(II) evaluation index

The improved algorithm is evaluated from the two angles of accuracy and timeliness of target detection. The accuracy judgment index adopts average accuracy (mean average precision, mAP) and average accuracy (average precision, AP); the timeliness evaluation index uses a maximum image Frame number (FPS) Per Second. The calculation formulas of the respective indexes are shown below.

Wherein P represents accuracy (Precision) and R represents Recall (Recall); AP refers to the area enclosed by the P-R curve; mAP value is obtained by average value of each AP; TP represents the number of frames detected correctly; FP represents the number of frames in which errors are detected; FN does not detect the number of GTs; the configurationnumber represents the total number of detected pictures; totalcime represents the total duration of detection.

IOU=0.5 is a common standard for testing the performance of an algorithm, and can reflect the comprehensive classification capability of the algorithm on various targets, so mAP50 is used as an mAP evaluation index.

(III) ablation experiments

To verify the effectiveness of the two improvement modules CA attention and ciou_loss in the model, an ablation experiment was performed on the improvement algorithm based on YOLOv5s, and the experimental results are shown in table 4.

Table 4 ablation experimental results

As can be seen from the experimental results in table 4, the yolov5s_ca model obtained by the CA attention module alone and the yolov5s_ciou model obtained by the ciou_loss function module alone were fused, so that the mAP50 of the original yolov5s algorithm was improved, and the effectiveness of the improvement module was demonstrated. In a comprehensive view, even though the accuracy of the improved algorithm yolov5s_ca_ciou is slightly lower than that of the yolov5s_ciou model obtained by independently introducing the ciou_loss function module, the recall rate and the mAP50 of the yolov5s_ca_ciou respectively reach 87.4% and 91.1%, which are respectively improved by 1.8% and 2.9% compared with the original yolov5s algorithm, which indicates that the improved effect caused by combining the attention of fusion CA with the replacement of ciou_loss function is optimal, and the effectiveness of the method is further verified.

(IV) comparative experiments

In order to more comprehensively verify the effectiveness of the method and further evaluate the improvement of the method in detection precision, speed and the like, the method is compared with the algorithm of YOLOv3, SSD and YOLOv5s for research. Experiments were performed under the same training conditions using the same data set, and the results are shown in table 5.

Table 5 comparison of results of mainstream algorithm detection

As can be seen from Table 5, the detection accuracy of the Yolov5s_CA_CIoU of the invention is highest, mAP50 reaches 91.1%, and the detection accuracy is improved by 8.9%, 5.3% and 2.9% respectively compared with SSD, yolov3 and Yolov5s algorithms. Even though the parameter quantity of the method is slightly larger than that of the YOLOv5s algorithm, the detection speed FPS is improved, which shows that the method can obtain higher detection precision at the cost of increasing a small part of parameter quantity. The SSD algorithm has a relatively weak feature extraction capability due to a relatively simple network structure, so that the detection accuracy is relatively limited when facing a complex detection task of the target background. The YOLOv3 algorithm enhances feature fusion, so that the detection capability is improved compared with an SSD model, but the detection precision is still far lower than that of the method.

In order to further evaluate the detection effect of the method, the detection results of partial remote sensing images in the RSOD data set are selected for visual comparison, and the detection effects of the four targets including the airplane, the oil drum, the overpass and the playground are evaluated from the angles of whether missed detection exists, false detection exists, the positioning accuracy of the target boundary box and the like. As shown in fig. 7, the visual detection results of four algorithms, i.e., SSD, YOLOv3, YOLOv5s, and YOLOv5s_ca_ciou, are sequentially from left to right, wherein the box with the largest gray value corresponds to the correct detection result, the box with the middle gray value represents the false detection, and the box with the smallest gray value represents the false detection result.

As can be obtained from the visual result, for the detection of the aircraft target, the aircraft target in the image has compact layout and different sizes, and the SSD algorithm in fig. 7 (a) has a missing detection condition; in fig. 7 (b), the YOLOv3 algorithm also has the condition of missed detection and false detection, and misjudges the blank area between two airplanes in the image as an airplane target; the YOLOv5s algorithm in fig. 7 (c) also has false detection and missing detection conditions, and cannot detect the small aircraft target on the right side of the image; in fig. 7 (d), the improved algorithm yolov5s_ca_ciou has no false detection and no missing detection, which proves that the detection accuracy of the improved algorithm is greatly improved compared with that of the original yolov5s algorithm in the small target detection task scene. For the detection of the oil drum target, the SSD algorithm in FIG. 7 (a) and the YOLOv3 algorithm in FIG. 7 (b) have the condition of missing detection, and partial oil drum targets are not effectively detected; the YOLOv5s algorithm in fig. 7 (c) has a false detection condition, and misjudges a blank area between objects to be detected as an oil drum object; while the improved algorithm yolov5s_ca_ciou in fig. 7 (d) accurately detects all targets, it is proved that the method of the invention effectively improves the target detection capability of the original yolov5s algorithm in dense scenes. Compared with other three algorithms, the improved algorithm yolov5s_ca_ciou in fig. 7 (d) has no missing detection and false detection, and is more accurate in positioning the target bounding box for the detection of the overpass and the playground target.

In conclusion, compared with three algorithms of SSD, YOLOv3 and YOLOv5s, the method disclosed by the invention has the advantages that the detection precision of YOLOv5s_CA_CIoU is higher, and the bounding box positioning of a detection target is also more accurate.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. The method for detecting the target of the YOLOv5 remote sensing image by using the attention mechanism fusion is characterized by comprising the following steps of:

2. The method for detecting a target of YOLOv5 remote sensing image by using attention mechanism fusion according to claim 1, wherein in step 1, the attention mechanism fusion in the neck layer of YOLOv5 specifically comprises: three Concat structures are selected in the neck layer, and one attention module is added before or after each Concat structure is selected.

3. The method for detecting a target of YOLOv5 remote sensing image by using attention mechanism fusion according to claim 1, wherein in step 1, the attention mechanism fusion in the output end of YOLOv5 specifically comprises: one attention module is added after each Conv layer at the output.

4. A YOLOv5 remote sensing image target detection method using attention mechanism fusion according to any one of claims 1-3, wherein the attention mechanism or attention module employs any one of CA, SE, ECA and CBAM attention.

5. The method for detecting an object of a YOLOv5 remote sensing image by using attention mechanism fusion according to claim 1, wherein in step 2, ciou_loss function is adopted as a Loss function.