CN116091946A

CN116091946A - Yolov 5-based unmanned aerial vehicle aerial image target detection method

Info

Publication number: CN116091946A
Application number: CN202211559260.3A
Authority: CN
Inventors: 周丽芳; 王智峰; 李伟生; 肖明琪; 王婧琳
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2022-12-06
Filing date: 2022-12-06
Publication date: 2023-05-09

Abstract

The invention discloses a YOLOv 5-based unmanned aerial vehicle aerial image target detection method, and belongs to the technical field of target detection. The method comprises the following steps: step 1, a YOLOv5 algorithm is used as a basic model frame, and in order to improve the accuracy of small target detection in aerial images on cities, the invention designs a network for extracting various contextual characteristics. In order to improve the attention of the network to the dense area, the invention provides a series-connection cross self-attention algorithm which is respectively added between the main network and the three detection heads to further enhance the information of the dense area. And step 3, obtaining a final network model through iterative training and parameter updating, improving the small target detection performance by utilizing multi-scale prediction, and finally obtaining a final result through three-scale detection head prediction. The invention effectively relieves the problem of losing the context information, enhances the feature extraction capability, captures more various feature spaces and realizes clearer anchor frame positioning.

Description

Yolov 5-based unmanned aerial vehicle aerial image target detection method

Technical Field

The invention relates to the field of computer vision and deep learning, in particular to a YOLOv 5-based unmanned aerial vehicle aerial image target detection method.

Background

Along with the rapid development of unmanned aerial vehicle technology and deep learning, high-resolution large-scale unmanned aerial vehicle image data are continuously rich, and the unmanned aerial vehicle image in a city generally has the problems of small target, high resolution, uneven target and the like. The artificial neural network is widely applied in the unmanned aerial vehicle image target detection field, most algorithms are based on prior frame modes, perform better on some traditional data sets, but perform generally on unmanned aerial vehicle images. Unmanned aerial vehicle image target detection with both detection speed and detection accuracy becomes a research hotspot in the current field.

Object detection is to find all interested objects in an image, and comprises two subtasks of object positioning and object classification, namely, the category and the position of the object are determined simultaneously. Currently widely used target detection methods are mainly divided into two categories: one-stage and Two-stage. The Two-stage method is based on an algorithm of a region, divides target detection into Two stages of detection and identification, firstly searches a region of interest in an image by the algorithm or a network, and then identifies targets in the region, such as RCNN, fast-RCNN and the like; the One-stage method is an end-to-end algorithm, and the regression thought is utilized to directly generate the category probability and the position coordinates of the target, so as to realize detection and identification, such as YOLO, SSD and the like. The One-stage method has advantages in terms of speed over the Two-stage method, but is relatively low in accuracy.

Because the problems of single imaging visual angle, dense target distribution, large target scale change and the like exist in the target in the unmanned aerial vehicle image, the natural scene target detection method is directly applied to the unmanned aerial vehicle image target detection task, and a satisfactory effect cannot be obtained. At the same time, the high resolution and large image size problems can also aggravate the computational cost of the algorithm. In recent years, the One-stage algorithm is comparable to the Two-stage algorithm in precision, the YOLO algorithm series is a representative One-stage algorithm, and the YOLOv5 algorithm is a target detection network with balanced speed and precision, but compared with the RCNN series object detection method, the accuracy of identifying the object position is poor, and the recall rate is low. Therefore, how to design an algorithm suitable for rapid target detection of unmanned aerial vehicle images, and meanwhile, improving the detection accuracy of objects in small targets and dense areas is still a difficulty.

CN113807464B, an unmanned aerial vehicle aerial image target detection method based on improved YOLO V5, belongs to the field of deep learning and target detection. According to the method, a related data set is firstly constructed by using an unmanned aerial vehicle aerial image, a slicing layer in a Focus module is replaced by a convolution layer in a YOLO V5 backbone network part, then image characteristics are further processed by a Neck part, then aiming at the problems of target stray distribution and target pixel ratio being too small caused by an unmanned aerial vehicle high-altitude aerial image viewing angle, a large detection head of 76 multiplied by 255 is optimally removed in a network prediction layer part, an anchor frame is adjusted at the same time, and finally target detection performance is evaluated through generalized cross-over ratio, average precision and reasoning speed. The method can realize the rapid and accurate detection of the unmanned aerial vehicle aerial image target on the basis of improving the recognition accuracy and the feature extraction performance.

The CN113807464B patent does not take into account the context information in the image when improving the backbone network, and only eliminates large detection heads in the detection head portion, and does not optimize existing detection heads. The invention uses the cavity convolution and the deformable convolution to enlarge the receptive field in the optimization of the backbone network, and obtains more comprehensive context information to help to detect the small target. Meanwhile, in the part of the detection head, the detection performance of the detection head is enhanced in a mode of shallow information and self-attention of a backbone network, so that the detection head is focused on an area with an object.

Disclosure of Invention

The present invention is directed to solving the above problems of the prior art. An unmanned aerial vehicle aerial image target detection method based on YOLOv5 is provided. The technical scheme of the invention is as follows:

an unmanned aerial vehicle aerial image target detection method based on YOLOv5 comprises the following steps:

step 1: dividing an unmanned aerial vehicle image data set into a training set and a testing set, preprocessing the training set and enhancing data to obtain a complete sample data set, and clustering the complete sample data set through a K-means algorithm to obtain the size of an anchor;

step 2: based on a backbone network of YOLOv5, constructing a context extraction module to extract characteristics of an unmanned aerial vehicle image and expand a receptive field by utilizing cavity convolution and deformable convolution;

step 3: between a backbone network and a Neck layer, in order to utilize the semantic information of a shallow layer and make the network focus on a dense area, a series-connection cross self-focusing module is constructed for characteristic enhancement;

step 4: and obtaining a final model through complete training, and detecting the test picture by using the model, thereby obtaining a detection result.

Further, the step 1 performs preprocessing and data enhancement on the training set to obtain a complete sample data set, and clusters the complete sample data set through a K-means algorithm to obtain an anchor, and specifically includes the following steps:

step 1.1, generating 1088 x 1088 pixel pictures of the pictures in the initial sample data set through scaling and stretching, and keeping the anchor frame proportion;

step 1.2, carrying out data enhancement on the picture data obtained in the step 1.1, and processing characteristic parameters of an object to be identified by adding sample data through operations of translation, rotation, saturation adjustment and exposure adjustment;

step 1.3, carrying out cluster analysis on a real target bounding box of the target to be identified, which is marked by the sample data training set obtained in the step 1.2, through a K-means clustering algorithm; initializing 9 anchor boxes by randomly selecting 9 values from all the bounding boxes as initial values of the anchor boxes; the IoU (cross-correlation) value of each bounding box and each anchor box is calculated IoU by:

where ∈ denotes the intersection and ∈ denotes the union.

Then selecting the highest IoU value for each binding box, and then solving the average value of all the binding boxes, namely the final accuracy value; finally, 9 accurate anchor boxes are obtained as preset values of the network.

Further, the step 2 of constructing a plurality of context extraction modules to extract features of the unmanned aerial vehicle image specifically includes:

2.1, after SPP (space pyramid pooling module) of a backbone network, respectively extracting features of an original feature map by using 3 groups of cavity convolutions, wherein the cavity rates are respectively set to be 1,2 and 3; performing Concat operation merging on the obtained feature images;

2.2, continuously using deformable convolution to correct boundary information on the feature map obtained in the step 2.1, wherein the specific method comprises the following steps: additionally building a convolution layer to learn offset information, and using offset to reposition the convolution position; finally, in order to ensure that the channel numbers are the same, 1x1 convolution dimension reduction is needed, and a jump key connection is needed to perform feature fusion.

Further, the step 3 of constructing a serial cross self-attention to perform feature extraction on the unmanned aerial vehicle image specifically includes:

3.1 on an Ultralytics version Yolov5 network model, cross-self-attention in series between the backbone and the detection head; firstly, 3 exchanges are usedPerforming cross convolution to extract features of the feature map; cross convolution emphasizes edge information by mining vertical and horizontal gradient information in parallel, providing information enhancement for subsequent self-attention; the cross convolution is designed by two asymmetric vertical filters, respectively marked as 1×3 and 3×1, and F is set _in ,F _out For inputting and outputting the characteristic diagram, there are

k _1×3 、k _3×1 Represented as different convolution kernel sizes, respectively.

3.2 for each feature map I εR ^C×H×W (C represents the number of channels, H represents the height of the feature map, W represents the width of the feature map, R represents all the feature maps passing through the network layer), three feature maps Q, K and V are independently generated by cross convolution, wherein Q, K E R ^C′×H×W C and C 'are both representative of the number of channels, here C' =c/8; the change dimension thereof is decomposed into Q_H (BXW, H, C ') and Q_W (BXH, W, C') according to each Q; q_ H, Q _w represents the decomposition of the Q-feature map in the longitudinal and transverse directions, respectively. Then, K is also treated in the same way; respectively weighting the transverse direction and the longitudinal direction to obtain A (Attention);

after obtaining a (Attention) and V (Value), processing is performed according to the obtained a to obtain a_h (b×w, H) and a_h (b×h, W), the a_h (b×w, H) and a_h (b×h, W) representing decomposition in the longitudinal and transverse directions of the a feature map, respectively. V is also processed to obtain V_H (B×W, C, H) and V_W (B×H, C, W); v_ H, V _w represents the decomposition of the V feature map in the longitudinal and transverse directions, respectively. Out represents the final output profile.

Finally, the cross self-attentiveness is connected in series to ensure that each point on the feature map can be associated with other points for calculation;

3.3 inputting the reinforced characteristics into three-scale YOLO detection heads, respectively corresponding to small, medium and large target objects, using the anchor boxes clustered in the step 1.3 as prior frames, and setting the number of predicted object categories;

3.4 up to now, the whole network frame is built.

Further, the step 4 obtains a final model through complete training, and uses the model to perform target detection on the test picture to obtain a final detection result, which specifically includes:

training the training set by using the network constructed in the step 3 to obtain a network output model;

4.2, downsampling the output of the network to obtain three multi-scale feature graphs, wherein each cell in the feature graphs predicts 3bounding boxes (bounding boxes), and each bounding box predicts: (1) The position of each frame (4 values, center coordinate t _x And t _y Height b of frame _h And width b _w ) (2) one objectness prediction (confidence), (3) N categories;

4.3 coordinate prediction of the binding Box, t _x 、t _y 、t _w 、t _h Is the predicted output of the model, c _x And c _y Representing the coordinates of the grid cells, and the coordinates c of the grid cells of the 0 th row and the 1 st column _x Namely 0, c _y Namely; p is p _w And p _h Representing the size of the pre-prediction binding box; b _x 、b _y 、b _w And b _h The coordinates and size of the center of the predicted binding box are obtained; the loss of coordinates adopts a square error loss;

b _x ＝δ(t _x )+c _x

b _y ＝δ(t _y )+c _y

p _r (object)*IOU(b,object)＝δ(t ₀ )

4.4, classifying the category by adopting multiple labels, wherein the category labels in the detected result possibly have two categories at the same time, and a logistic regression layer is needed to classify each category into two categories; the logistic regression layer mainly uses a sigmoid function, and the function can restrict the input within the range of 0 to 1, so that when the output of a certain class of an image subjected to feature extraction is restricted by the sigmoid function, if the output is larger than 0.5, the output belongs to the class.

The invention has the advantages and beneficial effects as follows:

the method mainly aims at the problem that in the current popular unmanned aerial vehicle image target detection task based on the deep convolutional neural network, the detection precision of images with uneven distribution on small targets and objects is not high; a YOLOv 5-based unmanned aerial vehicle target detection method added with various context information extraction modules and serial cross self-attention is provided. In the network structure design stage, selecting a YOLOv5 algorithm as a reference algorithm, and replacing the traditional convolution by utilizing the cavity convolution and the deformable convolution to extract various context characteristics and enlarge a receptive field; considering that the features extracted at the stage of the main network belong to shallow features, the features have rich semantic information, and the features at the neck are deep features, so that serial cross self-attention is added between the main network and the neck, and the neck features are effectively enhanced; the serial cross self-attention is further enhanced by calculating weight information in the transverse direction and the longitudinal direction, obtaining global features through serial connection and utilizing cross convolution; by taking the extracted features as the input of the YOLO detection head, three-scale prediction is carried out, so that the robustness of detection of a small target is enhanced. The method has good detection effect.

Drawings

FIG. 1 is a network framework of a method for detecting an image object of an unmanned aerial vehicle based on YOLOv5 according to a preferred embodiment of the present invention;

fig. 2 is a schematic diagram of a multiple context information extraction module according to the present invention.

FIG. 3 is a schematic illustration of serial cross self-attention of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and specifically described below with reference to the drawings in the embodiments of the present invention. The described embodiments are only a few embodiments of the present invention.

The technical scheme for solving the technical problems is as follows:

the embodiment of the invention is based on a YOLOv5 target detection framework as a basic framework, and the detail is shown in https:// gitsub. The context extraction module formed by the cavity convolution and the deformable convolution is added behind the SPP module of the backbone network, so that the receptive field is enlarged. The attention mechanism combined by the cross-rolling and the self-attention is added between the backbone network and the neck, so that the attention of the network to the dense area is improved.

The invention is further described below with reference to the accompanying drawings:

as shown in fig. 1, a design flow of a network frame of an unmanned aerial vehicle image target detection method based on YOLOv5 comprises the following steps:

A. the design step is performed on an Ultralytics version Yolov5 network model, and the backlight of Yolov5 comprises a Focus module, an SPP (spatial pyramid pooling) module, a plurality of CBS and C3 modules.

B. A plurality of context information extraction modules as shown in fig. 2 are added behind the SPP modules at the end of the backbone network. Firstly, carrying out feature extraction on the original feature map by using 3 groups of cavity convolutions, wherein the cavity rates are respectively set to be 1,2 and 3. And carrying out Concat operation merging on the obtained feature images.

C. And continuously using deformable convolution to correct boundary information for the feature map obtained in the last step, namely additionally building a convolution layer to learn offset information, and using offset to reposition the convolution position. Finally, in order to ensure that the channel numbers are the same, 1x1 convolution dimension reduction is needed, and a jump key connection is needed to perform feature fusion.

Further, in order to fuse the shallow information of the backbone network with the deep information of the neck, and make the network focus more on the dense area, the network crosses self-attention in series between the backbone and the detection head. The specific network flow design is shown in fig. 3, and the specific implementation steps are as follows:

A. feature extraction is performed on the feature map using 3 cross convolutions. Cross convolution emphasizes edge information by mining vertical and horizontal gradient information in parallel, providing information enhancement for subsequent self-attention. The cross convolution was designed with two asymmetric vertical filters, denoted 1×3 and 3×1, respectively. Set F _in ,F _out For inputting and outputting the characteristic diagram, there are

B. For each feature map I εR ^C×H×W The three feature maps Q, K and V are first independently generated by cross convolution. Wherein Q, K.epsilon.R ^C′×H×W . C and C' are both representative of the number of channels. Here C' =c/8 is set. The dimension of the change to each Q we is decomposed into Q_H (BXW, H, C ') and Q_W (BXH, W, C') according to each Q we. Then we do the same for K. We then weight the transverse and longitudinal directions to get a (Attention), respectively.

After obtaining A (Attention) and V (Value), we process according to the obtained A to obtain A_H (BxW, H, H) and A_H (BxH, W, W), and also process V to obtain V_H (BxW, C, H) and V_W (BxH, C, W).

C. Finally we connect the cross self-attention in series to ensure that each point on the feature map can be associated with other points.

D. And inputting the reinforced features into three-scale YOLO detection heads, respectively corresponding to the small, medium and large target objects, using the anchor boxes clustered in 1.3 as prior frames, and setting the number of the predicted object categories.

Further, a final model is obtained through complete training, and the picture to be tested is detected by using the model to obtain a final detection result, and the specific steps are as follows:

A. training the training set by using the network constructed in the steps to obtain a network output model;

B. downsampling the output of the network to obtain three multi-scale feature graphs, wherein each cell in the feature graph predicts 3bounding boxes (bounding boxes), and each bounding box predicts three things: (1) The position of each frame (4 values, center coordinate t _x And t _y Height b of frame _h And width b _w ) (2) one objectness prediction (confidence), (3) N categories;

C. coordinate prediction of a binding box, t _x 、t _y 、t _w 、t _h Is the predicted output of the model. c _x And c _y The coordinates of cells are shown, for example, the feature map of a layer is 13×13, so 13 cells are 13×13, and the coordinates c of the cells in row 0 and column 1 _x Namely 0, c _y Is 1.P is p _w And p _h Representing the size of the pre-prediction binding box. b _x 、b _y 、b _w And b _h The coordinates and size of the center of the predicted bounding box. The loss of coordinates adopts a square error loss; b _x ＝δ(t _x )+c _x

b _y ＝δ(t _y )+c _y

p _r (object)*IOU(b,object)＝δ(t ₀ )

D. The category prediction adopts multi-label classification, under a complex scene, one object may belong to a plurality of categories, and the category labels in the detected result may have two categories at the same time, so that a logistic regression layer is needed to perform two classification on each category. The logistic regression layer mainly uses a sigmoid function, and the function can restrict the input within the range of 0 to 1, so that when the output of a certain class of an image subjected to feature extraction is restricted by the sigmoid function, if the output is larger than 0.5, the output belongs to the class.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

The above examples should be understood as illustrative only and not limiting the scope of the invention. Various changes and modifications to the present invention may be made by one skilled in the art after reading the teachings herein, and such equivalent changes and modifications are intended to fall within the scope of the invention as defined in the appended claims.

Claims

1. The unmanned aerial vehicle aerial image target detection method based on YOLOv5 is characterized by comprising the following steps of:

step 1: dividing an unmanned aerial vehicle image data set into a training set and a testing set, preprocessing the training set and enhancing data to obtain a complete sample data set, and clustering through a K-means algorithm to obtain the size of an anchor frame;

step 2: based on a backbone network of YOLOv5, constructing a multi-context extraction module to extract features of an unmanned aerial vehicle image and expand a receptive field by utilizing cavity convolution and deformable convolution;

2. The method for detecting the unmanned aerial vehicle aerial image target based on YOLOv5 according to claim 1, wherein the step 1 is characterized in that a training set is preprocessed and data is enhanced to obtain a complete sample data set, and an anchor is obtained through K-means algorithm clustering, and specifically comprises the following steps:

step 1.3, carrying out cluster analysis on a real target bounding box of the target to be identified, which is marked by the sample data training set obtained in the step 1.2, through a K-means clustering algorithm; initializing 9 anchor boxes, and randomly selecting 9 values from all the bounding boxes as initial values of the anchor boxes; the IoU intersection ratio of each bounding box to each anchor box is calculated IoU by:

where ∈ represents intersection, and u represents union;

3. The method for detecting the target of the unmanned aerial vehicle aerial image based on YOLOv5 according to claim 2, wherein the step 2 is characterized in that a plurality of context extraction modules are constructed to extract the characteristics of the unmanned aerial vehicle image, and the method specifically comprises the following steps:

2.1, after an SPP space pyramid pooling module of a backbone network, respectively extracting features of an original feature map by using 3 groups of cavity convolutions, wherein the cavity rates are respectively set to be 1,2 and 3; performing Concat operation merging on the obtained feature images;

4. The method for detecting the target of the unmanned aerial vehicle aerial image based on YOLOv5 according to claim 3, wherein the step 3 is constructed to perform feature extraction on the unmanned aerial vehicle image by series-connection cross self-attention, and specifically comprises the following steps:

3.1 on an Ultralytics version Yolov5 network model, cross-self-attention in series between the backbone and the detection head; firstly, carrying out feature extraction on a feature map by using 3 cross convolutions; cross rollThe product emphasizes edge information by mining vertical and horizontal gradient information in parallel, and provides information enhancement for subsequent self-attention; the cross convolution is designed by two asymmetric vertical filters, respectively marked as 1×3 and 3×1, and F is set _in ,F _out For inputting and outputting the characteristic diagram, there are

k _1×3 、k _3×1 Respectively represented as different convolution kernel sizes;

3.2 for each feature map I εR ^C×H×W C represents the number of channels, H represents the height of the feature map, W represents the width of the feature map, R represents all the feature maps passing through the network layer, and first three feature maps Q, K and V are independently generated by cross convolution, wherein Q, K E R ^C′×H×W C and C 'are both representative of the number of channels, here C' =c/8; the change dimension thereof is decomposed into Q_H (BXW, H, C ') and Q_W (BXH, W, C') according to each Q; q_ H, Q _w represents the decomposition of the Q-feature map in the longitudinal and transverse directions, respectively. Then, K is also treated in the same way; respectively weighting the transverse direction and the longitudinal direction to obtain A (Attention);

3.4 up to now, the whole network frame is built.

5. The method for detecting the target of the aerial image of the unmanned aerial vehicle based on YOLOv5 according to claim 4, wherein the step 4 obtains a final model through complete training, and uses the model to carry out target detection on the test picture to obtain a final detection result, and the method specifically comprises the following steps:

4.2, downsampling the output of the network to obtain three multi-scale feature graphs, wherein each cell in the feature graphs predicts 3bounding boxes (bounding boxes), and each bounding box predicts: (1) The position of each frame (4 values, center coordinate t _x And t _y Height b of frame _h And width b _w ) (2) one objectness prediction confidence level, (3) N categories;

b _x ＝δ(t _x )+c _x

b _y ＝δ(t _y )+c _y

p _r (object)*IOU(b,object)＝δ(t ₀ )