CN114758255A

CN114758255A - Unmanned aerial vehicle detection method based on YOLOV5 algorithm

Info

Publication number: CN114758255A
Application number: CN202210350981.7A
Authority: CN
Inventors: 马峻; 王晓; 徐翠锋; 陈寿宏
Original assignee: Guilin University of Electronic Technology
Current assignee: Guilin University of Electronic Technology
Priority date: 2022-04-02
Filing date: 2022-04-02
Publication date: 2022-07-15

Abstract

The invention relates to the technical field of target detection, in particular to an unmanned aerial vehicle detection method based on a YOLOV5 algorithm, which comprises the steps of collecting image data of an unmanned aerial vehicle, screening and labeling the image data to obtain a training set and a verification set; training a network model by using the training set and the test set to obtain a detection model; the detection model is used for detecting the video of the unmanned aerial vehicle to obtain a detection result, the detection model obtained through training of the training set and the testing set can improve the detection accuracy of small targets such as the unmanned aerial vehicle under the condition of complex background or aggregation, the probability of false detection and missed detection is reduced, and the problems that the existing unmanned aerial vehicle detection technology is not sensitive to the small targets and the detection error is large under the condition of dense targets are solved.

Description

Unmanned aerial vehicle detection method based on YOLOV5 algorithm

Technical Field

The invention relates to the technical field of target detection, in particular to an unmanned aerial vehicle detection method based on a YOLOV5 algorithm.

Background

By means of the characteristics of small size, flexible action, easiness in operation and the like, the unmanned aerial vehicle is usually applied to target tracking, target searching and the like by utilizing an unmanned aerial vehicle detection technology.

At present, the existing unmanned aerial vehicle detection technology is mainly based on a deep learning model and can be roughly divided into two categories: one is a two-stage target detection algorithm that divides the detection problem into two stages, namely generating candidate regions containing approximate position information of the target, classifying the candidate regions, and refining the positions.

By adopting the mode, the method is insensitive to small targets, and the detection error is larger under the condition that the targets are dense.

Disclosure of Invention

The invention aims to provide an unmanned aerial vehicle detection method based on a YOLOV5 algorithm, and aims to solve the problems that the existing unmanned aerial vehicle detection technology is not sensitive to small targets and has large detection errors under the condition of dense targets.

In order to achieve the purpose, the invention provides an unmanned aerial vehicle detection method based on a Yolov5 algorithm, which comprises the following steps:

acquiring image data of an unmanned aerial vehicle, and screening and labeling the image data to obtain a training set and a verification set;

training a network model by using the training set and the test set to obtain a detection model;

and detecting the unmanned aerial vehicle video by using the detection model to obtain a detection result.

Wherein, gather unmanned aerial vehicle image data, and right the specific mode that image data filters and marks is:

acquiring unmanned aerial vehicle image data, and screening and labeling the image data to obtain a data set;

the data set is divided into a training set and a validation set.

The network model comprises an Input end, a Backbone part, a Neck part and a Head part.

Wherein, the network model is trained by using the training set and the testing set, and the concrete mode of obtaining the detection model is as follows:

carrying out standardized preprocessing on the training set through the Input end to obtain a data enhanced image;

extracting the features of the data enhanced image through the Backbone part to obtain a feature map set;

obtaining tensor data of the feature atlas through the Neck part based on feature pyramid structure and feature fusion;

calculating the gradient through the tensor data by the Head part to obtain a calculation result, and updating and verifying the gradient based on the calculation result and the verification set to obtain the YOLOV5 detection model.

The specific way to obtain the YOLOV5 detection model by calculating the gradient through the tensor data in the Head part to obtain a calculation result, and updating and verifying the gradient based on the calculation result and the verification set is as follows:

calculating the gradient through the Head part through tensor data based on a loss function and back propagation to obtain a calculation result;

updating the gradient based on the calculation result to obtain an updating condition;

and verifying the updating condition by using the verification set to obtain an evaluation index result of the trained model and obtain a Yolov5 detection model.

The method comprises the following steps of utilizing the detection model to detect the video of the unmanned aerial vehicle, and obtaining a detection result in a specific mode:

and processing the unmanned aerial vehicle video into frames by using the detection model, and detecting each frame of image to obtain a detection result.

The unmanned aerial vehicle detection method based on the YOLOV5 algorithm acquires unmanned aerial vehicle image data, and screens and marks the image data to obtain a training set and a verification set; training a network model by using the training set and the test set to obtain a detection model; the detection model is utilized to detect the video of the unmanned aerial vehicle to obtain a detection result, the training set and the testing set are trained to obtain the detection model, so that the detection accuracy of small targets such as the unmanned aerial vehicle under the condition of complex background or aggregation can be improved, the probability of false detection and missed detection is reduced, and the problems that the existing unmanned aerial vehicle detection technology is not sensitive to the small targets and the detection error is large under the condition of dense targets are solved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a flowchart of a method for detecting an unmanned aerial vehicle based on the YOLOV5 algorithm provided by the invention.

Fig. 2 is a network structure diagram of the unmanned aerial vehicle detection method based on the YOLOV5 algorithm provided by the present invention.

Figure 3 is a structural diagram of C3 SwinTR.

FIG. 4 is a Swin transform block diagram.

FIG. 5 is a schematic partitioning diagram of regular and shifted windows.

Fig. 6 is a schematic diagram of the FPN structure and PAN structure.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

Referring to fig. 1 to 6, the present invention provides a method for detecting an unmanned aerial vehicle based on YOLOV5 algorithm, including the following steps:

s1, collecting unmanned aerial vehicle image data, and screening and labeling the image data to obtain a training set and a verification set;

the concrete method is as follows:

s11, collecting unmanned aerial vehicle image data, and screening and labeling the image data to obtain a data set;

s12 divides the data set into a training set and a validation set.

S2, training a network model by using the training set and the test set to obtain a detection model;

specifically, the network model fuses the swin transformer, and specifically comprises an Input (Input), a Backbone network part (Backbone), a feature fusion part (tack) and a prediction output (Head).

The specific manner of step S2 is:

s21, carrying out standardized preprocessing on the training set through the Input end to obtain a data enhanced image;

specifically, the Input end processes the Input image by using a Mosaic data enhancement mode, a self-adaptive anchor frame calculation mode and a self-adaptive picture scaling mode.

And the Mosaic data enhancement adopts 4 pictures to be spliced in the modes of random zooming, random cutting and random arrangement. The method for enhancing the Mosaic data has the advantages of enriching data sets and reducing GPUs.

In the algorithm of the self-adaptive anchor frame calculation, an anchor frame with the length and the width set initially exists for different data sets. In the network training, the network outputs a prediction frame on the basis of an initial anchor frame, and then compares the prediction frame with a real frame group, calculates the difference between the prediction frame and the real frame group, and then reversely updates and iterates network parameters. Therefore, the initial anchor block is also an important part, and the function of calculating the initial anchor block value is embedded into the code in YOLOV5, and the optimal anchor block value in different training sets is calculated adaptively each time training.

In a common target detection algorithm, different pictures are different in length and width, so that the common method is to uniformly scale the original picture to a standard size and then send the standard size to a detection network. However, since many pictures have different aspect ratios, the sizes of the black edges at both ends are different after the scaling and filling, and if the filling ratio is large, information redundancy exists, which affects the inference speed. Yolov5 adds the least black edge to the original image in a self-adaptive manner, the black edges at two ends of the image height are reduced, and the calculated amount is reduced during reasoning, namely the target detection speed is improved.

Splicing 4 pictures in the training set in a random zooming, random cutting and random arrangement mode through the Mosaic data enhancement to obtain spliced pictures; calculating the optimal anchor frame value of the spliced picture through a self-adaptive anchor frame calculation algorithm to obtain a preprocessed picture; and carrying out scaling filling on the preprocessed picture to obtain a data enhanced image.

S22, performing feature extraction on the data enhanced image through the Backbone part to obtain a feature atlas;

specifically, the Backbone part extracts abundant information features from the input image, and mainly comprises two structures of convolution and C3, and C3 is a simplified Bottleneck CSP, because only 3 convolutions are provided except for the Bottleneck part, so that parameters can be reduced. The invention replaces the bottleeck with the swin transformer block, combines the C3 structure and names the structure as C3 SwinTR. A self-attention module is introduced into a backbone network, so that the network can better pay attention to global information and rich context information, and can better extract the characteristics of a target.

The Swin transducer block is composed of a left part and a right part, wherein the two parts are used alternately by W-MSA and SW-MSA, a two-layer Multilayer Perceptron (MLP) is simultaneously followed, and a layer normalization unit (LayerNorm, LN) is added before each Multi-headed Self-attention (MSA) and each MLP structure.

Swin Transformer Blocks was calculated as follows:

wherein z represents the output characteristics of the (S) WMSA module and the MLP module of block l (l represents a certain layer), respectively; W-MSA and SW-MSA represent window partition configurations based on multi-headed self-attention usage rules (regular windows) and shifts (shifted windows), respectively.

W-MSA and SW-MSA represent the attention mechanism using regular Windows and shifted Windows that enable the association between adjacent Windows of the previous layer, which is very useful for image classification, object detection and semantic segmentation. The partitioning of regular and shifted windows is shown in FIG. 5, where the left diagram is the regular window and the right diagram is the shifted window.

S23 obtaining tensor data of the feature atlas through the Neck part based on feature pyramid structure and feature fusion;

specifically, the Neck part mainly generates a feature pyramid based on information features, the feature pyramid can enhance detection of the model on objects with different scaling scales, so that the same object with different sizes and scales can be identified, and the Neck part is formed by combining an FPN structure and a PAN structure and used for mixing features of images and transferring the feature images to a prediction layer. The FPN structure and PAN structure are shown in FIG. 6, with the FPN on the left and PAN on the right.

The FPN structure establishes a top-down path for feature fusion, and then uses a fused feature layer with higher semantic information for prediction, but the structure is limited by unidirectional information flow;

the PAN structure is characterized in that a bottom-to-top path is established on the basis of the FPN, and the position information of the bottom layer is transmitted to the prediction feature layer, so that the prediction feature layer has the semantic information of the top layer and the position information of the bottom layer, and the target detection precision can be greatly improved.

S24, calculating the gradient through the tensor data by the Head part to obtain a calculation result, and updating and verifying the gradient based on the calculation result and the verification set to obtain the YOLOV5 detection model.

The concrete mode is as follows:

s241, calculating a gradient through the Head part through the tensor data based on a loss function and backward propagation to obtain a calculation result;

specifically, the Head part predicts the image features based on the feature pyramid, obtains a bounding box and predicts the category. The Loss function of the target detection task generally consists of a classification Loss function and a Bounding Box regression Loss function;

and adopting the CIOU _ Loss as a Loss function of the Bounding box.

The formula of CIOU _ Loss is as follows:

where p represents the calculation of the Euclidean distance between two center points, b and b^gtC represents the diagonal distance of the minimum closure area which simultaneously contains the prediction frame and the real frame, wherein v is a parameter for measuring the consistency of the aspect ratio, and the formula of v is shown as the formula (2):

wherein w^gtAnd h^gtRepresents the width and height of the real box, and w and h represent the width and height of the prediction box. CIOU _ Loss should take into account three important geometric factors for the regression function of the target box: the overlap area, center point distance, and aspect ratio are all taken into account.

S242, updating the gradient based on the calculation result to obtain an updating condition;

s243, the updating condition is verified by the verification set, the verification is successful, a detection model is obtained, the verification fails, the step S11 is returned, and the unmanned aerial vehicle image data are collected again.

S3, detecting the unmanned aerial vehicle video by using the detection model to obtain a detection result.

The concrete mode is as follows: processing the unmanned aerial vehicle video into frames by using the detection model, and detecting image frames of each frame to obtain a plurality of target frames; and screening the target frames to obtain a detection result. And aiming at the screening of a plurality of target frames, processing the redundant detection frames by using the IOU as a threshold value through nms operation so as to obtain the final correct detection result.

Conv in FIG. 2: a convolution layer;

c3: a structure consisting of a Bottleneck part and 3 convolutions;

c3 SwinTR: replacing the structure after the Bottleneck part in C3 with swin transform;

SPPF: a spatial pyramid pooling layer;

up Sample: upsampling;

concat: fusing the characteristics;

NMS: non-maxima suppression.

In FIG. 3, [ bs, c1, w, h ], [ bs, c2, w, h ], [ bs, c2/2, w, h ]: inputting parameters of a feature map;

batch size (bs, batch size): the number of samples trained at one time influences the optimization degree and speed of the model;

c 1: the number of input channels of the entire C3 SwinTR;

c 2: the number of output channels of the whole C3 SwinTR;

c 2/2: half of the number of output channels of the whole C3 SwinTR;

w (width), h (height): width and height of the feature map;

conv: performing convolution;

cv1, cv2, cv31, 2 and 3 volume blocks;

k (kernel): convolution kernel size;

s (stride): step size;

concat: and (5) splicing the characteristic diagrams.

In layer l (left) in fig. 4, a regular window division scheme is employed, and self-attention is calculated within each window. In the next layer 1+1 (right), the window partition is shifted, resulting in a new window. The self-attention computation of the new window crosses the boundary of the previous l-layer window, providing a connection between them.

A local windows self-ztention: a window to perform self-attention;

a patch: an image block;

pl in FIG. 5 represents the characteristics of the layer.

Although the invention has been described with reference to a preferred embodiment based on the YOLOV5 algorithm, it is understood that the scope of the invention is not limited thereto, and that all or part of the process flow for implementing the above embodiment may be implemented by those skilled in the art, and all equivalent changes made in the claims are still within the scope of the invention.

Claims

1. An unmanned aerial vehicle detection method based on a YOLOV5 algorithm is characterized by comprising the following steps:

2. The method of claim 1, wherein the method of UAV detection based on the YOLOV5 algorithm,

the method comprises the following steps of collecting unmanned aerial vehicle image data, and screening and labeling the image data in a specific mode:

the data set is divided into a training set and a validation set.

3. The method of claim 1, wherein the method of UAV detection based on the YOLOV5 algorithm,

the network model comprises an Input end, a backhaul part, a Neck part and a Head part.

4. A UAV detection method based on the YOLOV5 algorithm of claim 3,

the specific way of training the network model by using the training set and the test set to obtain the detection model is as follows:

performing feature extraction on the data enhanced image through the backhaul part to obtain a feature map set;

5. A UAV detection method based on the YOLOV5 algorithm of claim 4,

the specific way of calculating the gradient through the tensor data in the Head part to obtain a calculation result, updating and verifying the gradient based on the calculation result and the verification set, and obtaining the YOLOV5 detection model is as follows:

6. The method of claim 1, wherein the method of UAV detection based on the YOLOV5 algorithm,

the detection model is used for detecting the unmanned aerial vehicle video, and the specific mode for obtaining the detection result is as follows: