CN116091946A - Yolov 5-based unmanned aerial vehicle aerial image target detection method - Google Patents

Yolov 5-based unmanned aerial vehicle aerial image target detection method Download PDF

Info

Publication number
CN116091946A
CN116091946A CN202211559260.3A CN202211559260A CN116091946A CN 116091946 A CN116091946 A CN 116091946A CN 202211559260 A CN202211559260 A CN 202211559260A CN 116091946 A CN116091946 A CN 116091946A
Authority
CN
China
Prior art keywords
network
aerial vehicle
unmanned aerial
convolution
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211559260.3A
Other languages
Chinese (zh)
Inventor
周丽芳
王智峰
李伟生
肖明琪
王婧琳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN202211559260.3A priority Critical patent/CN116091946A/en
Publication of CN116091946A publication Critical patent/CN116091946A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes
    • G06V20/17Terrestrial scenes taken from planes or by drones
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/24Aligning, centring, orientation detection or correction of the image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/762Arrangements for image or video recognition or understanding using pattern recognition or machine learning using clustering, e.g. of similar faces in social networks
    • G06V10/763Non-hierarchical techniques, e.g. based on statistics of modelling distributions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • G06V10/765Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects using rules for classification or partitioning the feature space
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Remote Sensing (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a YOLOv 5-based unmanned aerial vehicle aerial image target detection method, and belongs to the technical field of target detection. The method comprises the following steps: step 1, a YOLOv5 algorithm is used as a basic model frame, and in order to improve the accuracy of small target detection in aerial images on cities, the invention designs a network for extracting various contextual characteristics. In order to improve the attention of the network to the dense area, the invention provides a series-connection cross self-attention algorithm which is respectively added between the main network and the three detection heads to further enhance the information of the dense area. And step 3, obtaining a final network model through iterative training and parameter updating, improving the small target detection performance by utilizing multi-scale prediction, and finally obtaining a final result through three-scale detection head prediction. The invention effectively relieves the problem of losing the context information, enhances the feature extraction capability, captures more various feature spaces and realizes clearer anchor frame positioning.

Description

Yolov 5-based unmanned aerial vehicle aerial image target detection method
Technical Field
The invention relates to the field of computer vision and deep learning, in particular to a YOLOv 5-based unmanned aerial vehicle aerial image target detection method.
Background
Along with the rapid development of unmanned aerial vehicle technology and deep learning, high-resolution large-scale unmanned aerial vehicle image data are continuously rich, and the unmanned aerial vehicle image in a city generally has the problems of small target, high resolution, uneven target and the like. The artificial neural network is widely applied in the unmanned aerial vehicle image target detection field, most algorithms are based on prior frame modes, perform better on some traditional data sets, but perform generally on unmanned aerial vehicle images. Unmanned aerial vehicle image target detection with both detection speed and detection accuracy becomes a research hotspot in the current field.
Object detection is to find all interested objects in an image, and comprises two subtasks of object positioning and object classification, namely, the category and the position of the object are determined simultaneously. Currently widely used target detection methods are mainly divided into two categories: one-stage and Two-stage. The Two-stage method is based on an algorithm of a region, divides target detection into Two stages of detection and identification, firstly searches a region of interest in an image by the algorithm or a network, and then identifies targets in the region, such as RCNN, fast-RCNN and the like; the One-stage method is an end-to-end algorithm, and the regression thought is utilized to directly generate the category probability and the position coordinates of the target, so as to realize detection and identification, such as YOLO, SSD and the like. The One-stage method has advantages in terms of speed over the Two-stage method, but is relatively low in accuracy.
Because the problems of single imaging visual angle, dense target distribution, large target scale change and the like exist in the target in the unmanned aerial vehicle image, the natural scene target detection method is directly applied to the unmanned aerial vehicle image target detection task, and a satisfactory effect cannot be obtained. At the same time, the high resolution and large image size problems can also aggravate the computational cost of the algorithm. In recent years, the One-stage algorithm is comparable to the Two-stage algorithm in precision, the YOLO algorithm series is a representative One-stage algorithm, and the YOLOv5 algorithm is a target detection network with balanced speed and precision, but compared with the RCNN series object detection method, the accuracy of identifying the object position is poor, and the recall rate is low. Therefore, how to design an algorithm suitable for rapid target detection of unmanned aerial vehicle images, and meanwhile, improving the detection accuracy of objects in small targets and dense areas is still a difficulty.
CN113807464B, an unmanned aerial vehicle aerial image target detection method based on improved YOLO V5, belongs to the field of deep learning and target detection. According to the method, a related data set is firstly constructed by using an unmanned aerial vehicle aerial image, a slicing layer in a Focus module is replaced by a convolution layer in a YOLO V5 backbone network part, then image characteristics are further processed by a Neck part, then aiming at the problems of target stray distribution and target pixel ratio being too small caused by an unmanned aerial vehicle high-altitude aerial image viewing angle, a large detection head of 76 multiplied by 255 is optimally removed in a network prediction layer part, an anchor frame is adjusted at the same time, and finally target detection performance is evaluated through generalized cross-over ratio, average precision and reasoning speed. The method can realize the rapid and accurate detection of the unmanned aerial vehicle aerial image target on the basis of improving the recognition accuracy and the feature extraction performance.
The CN113807464B patent does not take into account the context information in the image when improving the backbone network, and only eliminates large detection heads in the detection head portion, and does not optimize existing detection heads. The invention uses the cavity convolution and the deformable convolution to enlarge the receptive field in the optimization of the backbone network, and obtains more comprehensive context information to help to detect the small target. Meanwhile, in the part of the detection head, the detection performance of the detection head is enhanced in a mode of shallow information and self-attention of a backbone network, so that the detection head is focused on an area with an object.
Disclosure of Invention
The present invention is directed to solving the above problems of the prior art. An unmanned aerial vehicle aerial image target detection method based on YOLOv5 is provided. The technical scheme of the invention is as follows:
an unmanned aerial vehicle aerial image target detection method based on YOLOv5 comprises the following steps:
step 1: dividing an unmanned aerial vehicle image data set into a training set and a testing set, preprocessing the training set and enhancing data to obtain a complete sample data set, and clustering the complete sample data set through a K-means algorithm to obtain the size of an anchor;
step 2: based on a backbone network of YOLOv5, constructing a context extraction module to extract characteristics of an unmanned aerial vehicle image and expand a receptive field by utilizing cavity convolution and deformable convolution;
step 3: between a backbone network and a Neck layer, in order to utilize the semantic information of a shallow layer and make the network focus on a dense area, a series-connection cross self-focusing module is constructed for characteristic enhancement;
step 4: and obtaining a final model through complete training, and detecting the test picture by using the model, thereby obtaining a detection result.
Further, the step 1 performs preprocessing and data enhancement on the training set to obtain a complete sample data set, and clusters the complete sample data set through a K-means algorithm to obtain an anchor, and specifically includes the following steps:
step 1.1, generating 1088 x 1088 pixel pictures of the pictures in the initial sample data set through scaling and stretching, and keeping the anchor frame proportion;
step 1.2, carrying out data enhancement on the picture data obtained in the step 1.1, and processing characteristic parameters of an object to be identified by adding sample data through operations of translation, rotation, saturation adjustment and exposure adjustment;
step 1.3, carrying out cluster analysis on a real target bounding box of the target to be identified, which is marked by the sample data training set obtained in the step 1.2, through a K-means clustering algorithm; initializing 9 anchor boxes by randomly selecting 9 values from all the bounding boxes as initial values of the anchor boxes; the IoU (cross-correlation) value of each bounding box and each anchor box is calculated IoU by:
Figure BDA0003983915980000031
where ∈ denotes the intersection and ∈ denotes the union.
Then selecting the highest IoU value for each binding box, and then solving the average value of all the binding boxes, namely the final accuracy value; finally, 9 accurate anchor boxes are obtained as preset values of the network.
Further, the step 2 of constructing a plurality of context extraction modules to extract features of the unmanned aerial vehicle image specifically includes:
2.1, after SPP (space pyramid pooling module) of a backbone network, respectively extracting features of an original feature map by using 3 groups of cavity convolutions, wherein the cavity rates are respectively set to be 1,2 and 3; performing Concat operation merging on the obtained feature images;
2.2, continuously using deformable convolution to correct boundary information on the feature map obtained in the step 2.1, wherein the specific method comprises the following steps: additionally building a convolution layer to learn offset information, and using offset to reposition the convolution position; finally, in order to ensure that the channel numbers are the same, 1x1 convolution dimension reduction is needed, and a jump key connection is needed to perform feature fusion.
Further, the step 3 of constructing a serial cross self-attention to perform feature extraction on the unmanned aerial vehicle image specifically includes:
3.1 on an Ultralytics version Yolov5 network model, cross-self-attention in series between the backbone and the detection head; firstly, 3 exchanges are usedPerforming cross convolution to extract features of the feature map; cross convolution emphasizes edge information by mining vertical and horizontal gradient information in parallel, providing information enhancement for subsequent self-attention; the cross convolution is designed by two asymmetric vertical filters, respectively marked as 1×3 and 3×1, and F is set in ,F out For inputting and outputting the characteristic diagram, there are
Figure BDA0003983915980000041
k 1×3 、k 3×1 Represented as different convolution kernel sizes, respectively.
3.2 for each feature map I εR C×H×W (C represents the number of channels, H represents the height of the feature map, W represents the width of the feature map, R represents all the feature maps passing through the network layer), three feature maps Q, K and V are independently generated by cross convolution, wherein Q, K E R C′×H×W C and C 'are both representative of the number of channels, here C' =c/8; the change dimension thereof is decomposed into Q_H (BXW, H, C ') and Q_W (BXH, W, C') according to each Q; q_ H, Q _w represents the decomposition of the Q-feature map in the longitudinal and transverse directions, respectively. Then, K is also treated in the same way; respectively weighting the transverse direction and the longitudinal direction to obtain A (Attention);
Figure BDA0003983915980000042
after obtaining a (Attention) and V (Value), processing is performed according to the obtained a to obtain a_h (b×w, H) and a_h (b×h, W), the a_h (b×w, H) and a_h (b×h, W) representing decomposition in the longitudinal and transverse directions of the a feature map, respectively. V is also processed to obtain V_H (B×W, C, H) and V_W (B×H, C, W); v_ H, V _w represents the decomposition of the V feature map in the longitudinal and transverse directions, respectively. Out represents the final output profile.
Figure BDA0003983915980000051
Finally, the cross self-attentiveness is connected in series to ensure that each point on the feature map can be associated with other points for calculation;
3.3 inputting the reinforced characteristics into three-scale YOLO detection heads, respectively corresponding to small, medium and large target objects, using the anchor boxes clustered in the step 1.3 as prior frames, and setting the number of predicted object categories;
3.4 up to now, the whole network frame is built.
Further, the step 4 obtains a final model through complete training, and uses the model to perform target detection on the test picture to obtain a final detection result, which specifically includes:
training the training set by using the network constructed in the step 3 to obtain a network output model;
4.2, downsampling the output of the network to obtain three multi-scale feature graphs, wherein each cell in the feature graphs predicts 3bounding boxes (bounding boxes), and each bounding box predicts: (1) The position of each frame (4 values, center coordinate t x And t y Height b of frame h And width b w ) (2) one objectness prediction (confidence), (3) N categories;
4.3 coordinate prediction of the binding Box, t x 、t y 、t w 、t h Is the predicted output of the model, c x And c y Representing the coordinates of the grid cells, and the coordinates c of the grid cells of the 0 th row and the 1 st column x Namely 0, c y Namely; p is p w And p h Representing the size of the pre-prediction binding box; b x 、b y 、b w And b h The coordinates and size of the center of the predicted binding box are obtained; the loss of coordinates adopts a square error loss;
b x =δ(t x )+c x
b y =δ(t y )+c y
Figure BDA0003983915980000052
Figure BDA0003983915980000053
p r (object)*IOU(b,object)=δ(t 0 )
4.4, classifying the category by adopting multiple labels, wherein the category labels in the detected result possibly have two categories at the same time, and a logistic regression layer is needed to classify each category into two categories; the logistic regression layer mainly uses a sigmoid function, and the function can restrict the input within the range of 0 to 1, so that when the output of a certain class of an image subjected to feature extraction is restricted by the sigmoid function, if the output is larger than 0.5, the output belongs to the class.
The invention has the advantages and beneficial effects as follows:
the method mainly aims at the problem that in the current popular unmanned aerial vehicle image target detection task based on the deep convolutional neural network, the detection precision of images with uneven distribution on small targets and objects is not high; a YOLOv 5-based unmanned aerial vehicle target detection method added with various context information extraction modules and serial cross self-attention is provided. In the network structure design stage, selecting a YOLOv5 algorithm as a reference algorithm, and replacing the traditional convolution by utilizing the cavity convolution and the deformable convolution to extract various context characteristics and enlarge a receptive field; considering that the features extracted at the stage of the main network belong to shallow features, the features have rich semantic information, and the features at the neck are deep features, so that serial cross self-attention is added between the main network and the neck, and the neck features are effectively enhanced; the serial cross self-attention is further enhanced by calculating weight information in the transverse direction and the longitudinal direction, obtaining global features through serial connection and utilizing cross convolution; by taking the extracted features as the input of the YOLO detection head, three-scale prediction is carried out, so that the robustness of detection of a small target is enhanced. The method has good detection effect.
Drawings
FIG. 1 is a network framework of a method for detecting an image object of an unmanned aerial vehicle based on YOLOv5 according to a preferred embodiment of the present invention;
fig. 2 is a schematic diagram of a multiple context information extraction module according to the present invention.
FIG. 3 is a schematic illustration of serial cross self-attention of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and specifically described below with reference to the drawings in the embodiments of the present invention. The described embodiments are only a few embodiments of the present invention.
The technical scheme for solving the technical problems is as follows:
the embodiment of the invention is based on a YOLOv5 target detection framework as a basic framework, and the detail is shown in https:// gitsub. The context extraction module formed by the cavity convolution and the deformable convolution is added behind the SPP module of the backbone network, so that the receptive field is enlarged. The attention mechanism combined by the cross-rolling and the self-attention is added between the backbone network and the neck, so that the attention of the network to the dense area is improved.
The invention is further described below with reference to the accompanying drawings:
as shown in fig. 1, a design flow of a network frame of an unmanned aerial vehicle image target detection method based on YOLOv5 comprises the following steps:
A. the design step is performed on an Ultralytics version Yolov5 network model, and the backlight of Yolov5 comprises a Focus module, an SPP (spatial pyramid pooling) module, a plurality of CBS and C3 modules.
B. A plurality of context information extraction modules as shown in fig. 2 are added behind the SPP modules at the end of the backbone network. Firstly, carrying out feature extraction on the original feature map by using 3 groups of cavity convolutions, wherein the cavity rates are respectively set to be 1,2 and 3. And carrying out Concat operation merging on the obtained feature images.
C. And continuously using deformable convolution to correct boundary information for the feature map obtained in the last step, namely additionally building a convolution layer to learn offset information, and using offset to reposition the convolution position. Finally, in order to ensure that the channel numbers are the same, 1x1 convolution dimension reduction is needed, and a jump key connection is needed to perform feature fusion.
Further, in order to fuse the shallow information of the backbone network with the deep information of the neck, and make the network focus more on the dense area, the network crosses self-attention in series between the backbone and the detection head. The specific network flow design is shown in fig. 3, and the specific implementation steps are as follows:
A. feature extraction is performed on the feature map using 3 cross convolutions. Cross convolution emphasizes edge information by mining vertical and horizontal gradient information in parallel, providing information enhancement for subsequent self-attention. The cross convolution was designed with two asymmetric vertical filters, denoted 1×3 and 3×1, respectively. Set F in ,F out For inputting and outputting the characteristic diagram, there are
Figure BDA0003983915980000081
B. For each feature map I εR C×H×W The three feature maps Q, K and V are first independently generated by cross convolution. Wherein Q, K.epsilon.R C′×H×W . C and C' are both representative of the number of channels. Here C' =c/8 is set. The dimension of the change to each Q we is decomposed into Q_H (BXW, H, C ') and Q_W (BXH, W, C') according to each Q we. Then we do the same for K. We then weight the transverse and longitudinal directions to get a (Attention), respectively.
Figure BDA0003983915980000082
After obtaining A (Attention) and V (Value), we process according to the obtained A to obtain A_H (BxW, H, H) and A_H (BxH, W, W), and also process V to obtain V_H (BxW, C, H) and V_W (BxH, C, W).
Figure BDA0003983915980000083
C. Finally we connect the cross self-attention in series to ensure that each point on the feature map can be associated with other points.
D. And inputting the reinforced features into three-scale YOLO detection heads, respectively corresponding to the small, medium and large target objects, using the anchor boxes clustered in 1.3 as prior frames, and setting the number of the predicted object categories.
Further, a final model is obtained through complete training, and the picture to be tested is detected by using the model to obtain a final detection result, and the specific steps are as follows:
A. training the training set by using the network constructed in the steps to obtain a network output model;
B. downsampling the output of the network to obtain three multi-scale feature graphs, wherein each cell in the feature graph predicts 3bounding boxes (bounding boxes), and each bounding box predicts three things: (1) The position of each frame (4 values, center coordinate t x And t y Height b of frame h And width b w ) (2) one objectness prediction (confidence), (3) N categories;
C. coordinate prediction of a binding box, t x 、t y 、t w 、t h Is the predicted output of the model. c x And c y The coordinates of cells are shown, for example, the feature map of a layer is 13×13, so 13 cells are 13×13, and the coordinates c of the cells in row 0 and column 1 x Namely 0, c y Is 1.P is p w And p h Representing the size of the pre-prediction binding box. b x 、b y 、b w And b h The coordinates and size of the center of the predicted bounding box. The loss of coordinates adopts a square error loss; b x =δ(t x )+c x
b y =δ(t y )+c y
Figure BDA0003983915980000091
Figure BDA0003983915980000092
p r (object)*IOU(b,object)=δ(t 0 )
D. The category prediction adopts multi-label classification, under a complex scene, one object may belong to a plurality of categories, and the category labels in the detected result may have two categories at the same time, so that a logistic regression layer is needed to perform two classification on each category. The logistic regression layer mainly uses a sigmoid function, and the function can restrict the input within the range of 0 to 1, so that when the output of a certain class of an image subjected to feature extraction is restricted by the sigmoid function, if the output is larger than 0.5, the output belongs to the class.
The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.
The above examples should be understood as illustrative only and not limiting the scope of the invention. Various changes and modifications to the present invention may be made by one skilled in the art after reading the teachings herein, and such equivalent changes and modifications are intended to fall within the scope of the invention as defined in the appended claims.

Claims (5)

1. The unmanned aerial vehicle aerial image target detection method based on YOLOv5 is characterized by comprising the following steps of:
step 1: dividing an unmanned aerial vehicle image data set into a training set and a testing set, preprocessing the training set and enhancing data to obtain a complete sample data set, and clustering through a K-means algorithm to obtain the size of an anchor frame;
step 2: based on a backbone network of YOLOv5, constructing a multi-context extraction module to extract features of an unmanned aerial vehicle image and expand a receptive field by utilizing cavity convolution and deformable convolution;
step 3: between a backbone network and a Neck layer, in order to utilize the semantic information of a shallow layer and make the network focus on a dense area, a series-connection cross self-focusing module is constructed for characteristic enhancement;
step 4: and obtaining a final model through complete training, and detecting the test picture by using the model, thereby obtaining a detection result.
2. The method for detecting the unmanned aerial vehicle aerial image target based on YOLOv5 according to claim 1, wherein the step 1 is characterized in that a training set is preprocessed and data is enhanced to obtain a complete sample data set, and an anchor is obtained through K-means algorithm clustering, and specifically comprises the following steps:
step 1.1, generating 1088 x 1088 pixel pictures of the pictures in the initial sample data set through scaling and stretching, and keeping the anchor frame proportion;
step 1.2, carrying out data enhancement on the picture data obtained in the step 1.1, and processing characteristic parameters of an object to be identified by adding sample data through operations of translation, rotation, saturation adjustment and exposure adjustment;
step 1.3, carrying out cluster analysis on a real target bounding box of the target to be identified, which is marked by the sample data training set obtained in the step 1.2, through a K-means clustering algorithm; initializing 9 anchor boxes, and randomly selecting 9 values from all the bounding boxes as initial values of the anchor boxes; the IoU intersection ratio of each bounding box to each anchor box is calculated IoU by:
Figure QLYQS_1
where ∈ represents intersection, and u represents union;
then selecting the highest IoU value for each binding box, and then solving the average value of all the binding boxes, namely the final accuracy value; finally, 9 accurate anchor boxes are obtained as preset values of the network.
3. The method for detecting the target of the unmanned aerial vehicle aerial image based on YOLOv5 according to claim 2, wherein the step 2 is characterized in that a plurality of context extraction modules are constructed to extract the characteristics of the unmanned aerial vehicle image, and the method specifically comprises the following steps:
2.1, after an SPP space pyramid pooling module of a backbone network, respectively extracting features of an original feature map by using 3 groups of cavity convolutions, wherein the cavity rates are respectively set to be 1,2 and 3; performing Concat operation merging on the obtained feature images;
2.2, continuously using deformable convolution to correct boundary information on the feature map obtained in the step 2.1, wherein the specific method comprises the following steps: additionally building a convolution layer to learn offset information, and using offset to reposition the convolution position; finally, in order to ensure that the channel numbers are the same, 1x1 convolution dimension reduction is needed, and a jump key connection is needed to perform feature fusion.
4. The method for detecting the target of the unmanned aerial vehicle aerial image based on YOLOv5 according to claim 3, wherein the step 3 is constructed to perform feature extraction on the unmanned aerial vehicle image by series-connection cross self-attention, and specifically comprises the following steps:
3.1 on an Ultralytics version Yolov5 network model, cross-self-attention in series between the backbone and the detection head; firstly, carrying out feature extraction on a feature map by using 3 cross convolutions; cross rollThe product emphasizes edge information by mining vertical and horizontal gradient information in parallel, and provides information enhancement for subsequent self-attention; the cross convolution is designed by two asymmetric vertical filters, respectively marked as 1×3 and 3×1, and F is set in ,F out For inputting and outputting the characteristic diagram, there are
Figure QLYQS_2
k 1×3 、k 3×1 Respectively represented as different convolution kernel sizes;
3.2 for each feature map I εR C×H×W C represents the number of channels, H represents the height of the feature map, W represents the width of the feature map, R represents all the feature maps passing through the network layer, and first three feature maps Q, K and V are independently generated by cross convolution, wherein Q, K E R C′×H×W C and C 'are both representative of the number of channels, here C' =c/8; the change dimension thereof is decomposed into Q_H (BXW, H, C ') and Q_W (BXH, W, C') according to each Q; q_ H, Q _w represents the decomposition of the Q-feature map in the longitudinal and transverse directions, respectively. Then, K is also treated in the same way; respectively weighting the transverse direction and the longitudinal direction to obtain A (Attention);
Figure QLYQS_3
after obtaining a (Attention) and V (Value), processing is performed according to the obtained a to obtain a_h (b×w, H) and a_h (b×h, W), the a_h (b×w, H) and a_h (b×h, W) representing decomposition in the longitudinal and transverse directions of the a feature map, respectively. V is also processed to obtain V_H (B×W, C, H) and V_W (B×H, C, W); v_ H, V _w represents the decomposition of the V feature map in the longitudinal and transverse directions, respectively. Out represents the final output profile.
Figure QLYQS_4
Finally, the cross self-attentiveness is connected in series to ensure that each point on the feature map can be associated with other points for calculation;
3.3 inputting the reinforced characteristics into three-scale YOLO detection heads, respectively corresponding to small, medium and large target objects, using the anchor boxes clustered in the step 1.3 as prior frames, and setting the number of predicted object categories;
3.4 up to now, the whole network frame is built.
5. The method for detecting the target of the aerial image of the unmanned aerial vehicle based on YOLOv5 according to claim 4, wherein the step 4 obtains a final model through complete training, and uses the model to carry out target detection on the test picture to obtain a final detection result, and the method specifically comprises the following steps:
training the training set by using the network constructed in the step 3 to obtain a network output model;
4.2, downsampling the output of the network to obtain three multi-scale feature graphs, wherein each cell in the feature graphs predicts 3bounding boxes (bounding boxes), and each bounding box predicts: (1) The position of each frame (4 values, center coordinate t x And t y Height b of frame h And width b w ) (2) one objectness prediction confidence level, (3) N categories;
4.3 coordinate prediction of the binding Box, t x 、t y 、t w 、t h Is the predicted output of the model, c x And c y Representing the coordinates of the grid cells, and the coordinates c of the grid cells of the 0 th row and the 1 st column x Namely 0, c y Namely; p is p w And p h Representing the size of the pre-prediction binding box; b x 、b y 、b w And b h The coordinates and size of the center of the predicted binding box are obtained; the loss of coordinates adopts a square error loss;
b x =δ(t x )+c x
b y =δ(t y )+c y
Figure QLYQS_5
Figure QLYQS_6
p r (object)*IOU(b,object)=δ(t 0 )
4.4, classifying the category by adopting multiple labels, wherein the category labels in the detected result possibly have two categories at the same time, and a logistic regression layer is needed to classify each category into two categories; the logistic regression layer mainly uses a sigmoid function, and the function can restrict the input within the range of 0 to 1, so that when the output of a certain class of an image subjected to feature extraction is restricted by the sigmoid function, if the output is larger than 0.5, the output belongs to the class.
CN202211559260.3A 2022-12-06 2022-12-06 Yolov 5-based unmanned aerial vehicle aerial image target detection method Pending CN116091946A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211559260.3A CN116091946A (en) 2022-12-06 2022-12-06 Yolov 5-based unmanned aerial vehicle aerial image target detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211559260.3A CN116091946A (en) 2022-12-06 2022-12-06 Yolov 5-based unmanned aerial vehicle aerial image target detection method

Publications (1)

Publication Number Publication Date
CN116091946A true CN116091946A (en) 2023-05-09

Family

ID=86212763

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211559260.3A Pending CN116091946A (en) 2022-12-06 2022-12-06 Yolov 5-based unmanned aerial vehicle aerial image target detection method

Country Status (1)

Country Link
CN (1) CN116091946A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117456389A (en) * 2023-11-07 2024-01-26 西安电子科技大学 Improved unmanned aerial vehicle aerial image dense and small target identification method, system, equipment and medium based on YOLOv5s
CN117611877A (en) * 2023-10-30 2024-02-27 西安电子科技大学 LS-YOLO network-based remote sensing image landslide detection method

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117611877A (en) * 2023-10-30 2024-02-27 西安电子科技大学 LS-YOLO network-based remote sensing image landslide detection method
CN117611877B (en) * 2023-10-30 2024-05-14 西安电子科技大学 LS-YOLO network-based remote sensing image landslide detection method
CN117456389A (en) * 2023-11-07 2024-01-26 西安电子科技大学 Improved unmanned aerial vehicle aerial image dense and small target identification method, system, equipment and medium based on YOLOv5s

Similar Documents

Publication Publication Date Title
CN110443143B (en) Multi-branch convolutional neural network fused remote sensing image scene classification method
CN112396002B (en) SE-YOLOv 3-based lightweight remote sensing target detection method
CN111931684B (en) Weak and small target detection method based on video satellite data identification features
Wang et al. Multiscale visual attention networks for object detection in VHR remote sensing images
CN111027493B (en) Pedestrian detection method based on deep learning multi-network soft fusion
CN114202672A (en) Small target detection method based on attention mechanism
CN111783576B (en) Pedestrian re-identification method based on improved YOLOv3 network and feature fusion
Li et al. Adaptive deep convolutional neural networks for scene-specific object detection
CN112150493B (en) Semantic guidance-based screen area detection method in natural scene
CN111767882A (en) Multi-mode pedestrian detection method based on improved YOLO model
CN107633226B (en) Human body motion tracking feature processing method
CN112633382B (en) Method and system for classifying few sample images based on mutual neighbor
CN106257496B (en) Mass network text and non-textual image classification method
CN116091946A (en) Yolov 5-based unmanned aerial vehicle aerial image target detection method
CN113052185A (en) Small sample target detection method based on fast R-CNN
CN105574545B (en) The semantic cutting method of street environment image various visual angles and device
CN111461006B (en) Optical remote sensing image tower position detection method based on deep migration learning
CN116051953A (en) Small target detection method based on selectable convolution kernel network and weighted bidirectional feature pyramid
CN112488229A (en) Domain self-adaptive unsupervised target detection method based on feature separation and alignment
Bai et al. Multimodal information fusion for weather systems and clouds identification from satellite images
CN114332473A (en) Object detection method, object detection device, computer equipment, storage medium and program product
CN114283326A (en) Underwater target re-identification method combining local perception and high-order feature reconstruction
CN113723558A (en) Remote sensing image small sample ship detection method based on attention mechanism
CN109241315A (en) A kind of fast face search method based on deep learning
CN117152625A (en) Remote sensing small target identification method, system, equipment and medium based on CoordConv and Yolov5

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination