CN115410100A

CN115410100A - Small target detection method and system based on unmanned aerial vehicle image

Info

Publication number: CN115410100A
Application number: CN202210881394.0A
Authority: CN
Inventors: 朱敦尧; 余雄风
Original assignee: Wuhan Kotei Informatics Co Ltd
Current assignee: Wuhan Kotei Informatics Co Ltd
Priority date: 2022-07-20
Filing date: 2022-07-20
Publication date: 2022-11-29

Abstract

The invention provides a small target detection method and a system based on unmanned aerial vehicle images, wherein the method comprises the following steps: acquiring an image acquired by an unmanned aerial vehicle, preprocessing the image, and calculating anchor values of targets with different sizes in the image through a clustering algorithm; extracting feature maps with different resolutions through a CSP-Darknet53 backbone feature extraction model; performing feature reinforcement on the feature map based on the swin-Transformer structure and the SPPF structure; fusing different network layer feature information to extract target features based on feature maps with different resolutions and feature maps with enhanced features; and detecting and outputting the position and the category information of the target according to the extracted target characteristics. By the scheme, the omission problem of small target detection in the unmanned aerial vehicle image can be reduced, and the detection precision of the small target is improved.

Description

Small target detection method and system based on unmanned aerial vehicle image

Technical Field

The invention belongs to the field of image processing, and particularly relates to a small target detection method and system based on an unmanned aerial vehicle image.

Background

With the rapid development of unmanned aerial vehicle image acquisition technology, applications based on unmanned aerial vehicle image processing are more mature. Compared with a common image, the view angle image of the unmanned aerial vehicle is small, the targets are multiple and dense, the view angle image is limited by the target type and the flight height of the unmanned aerial vehicle, and small target detection based on the image of the unmanned aerial vehicle is difficult.

Currently, the mainstream unmanned aerial vehicle image target detection classification has two modes, namely a single-stage mode and a two-stage mode, typical single-stage algorithms include SSD and YOLO, and typical two-stage algorithms include Fast R-CNN and Fast R-CNN. The two-stage algorithm is superior to the single-stage algorithm in accuracy because candidate boxes need to be generated first and then classification regression is carried out. However, for small targets, the small targets are affected by shooting distance, target pixels are few, imaging is fuzzy and limited by the size of a prediction frame, and the like, so that the accuracy of detecting the small targets in the unmanned aerial vehicle image is low, and missing detection easily exists.

Disclosure of Invention

In view of this, the embodiment of the invention provides a small target detection method and system based on an unmanned aerial vehicle image, which are used for solving the problem of low small target detection accuracy in the existing unmanned aerial vehicle image recognition.

In a first aspect of the embodiments of the present invention, a small target detection method based on an unmanned aerial vehicle image is provided, including:

acquiring an image acquired by an unmanned aerial vehicle, preprocessing the image, and calculating anchor values of targets with different sizes in the image through a clustering algorithm;

extracting feature maps with different resolutions through a CSP-Darknet53 backbone feature extraction model;

performing feature reinforcement on the feature map based on a swin-Transformer structure and an SPPF structure;

fusing different network layer feature information to extract target features based on feature maps with different resolutions and feature maps with enhanced features;

and detecting and outputting the position and the category information of the target according to the extracted target characteristics.

In a second aspect of the embodiments of the present invention, there is provided a small target detection system based on an unmanned aerial vehicle image, including:

the preprocessing module is used for receiving the images acquired by the unmanned aerial vehicle, preprocessing the images and calculating anchor values of targets with different sizes in the images through a clustering algorithm;

the feature extraction module is used for extracting feature maps with different resolutions through a CSP-Darknet53 backbone feature extraction model;

the characteristic enhancement module is used for carrying out characteristic enhancement on the characteristic diagram based on the swin-Transformer structure and the SPPF structure;

the enhanced feature extraction module is used for fusing different network layer feature information to extract target features based on feature maps with different resolutions and feature maps with enhanced features;

and the target prediction module is used for detecting and outputting the position and the category information of the target according to the extracted target characteristics.

In a third aspect of the embodiments of the present invention, there is provided an electronic device, including a memory, a processor, and a computer program stored in the memory and executable by the processor, where the processor executes the computer program to implement the steps of the method according to the first aspect of the embodiments of the present invention.

In a fourth aspect of the embodiments of the present invention, a computer-readable storage medium is provided, which stores a computer program, which when executed by a processor implements the steps of the method provided by the first aspect of the embodiments of the present invention.

In the embodiment of the invention, on the basis of a CSP-Darknet53 backbone feature extraction model, a swin-transducer structure and an SPPF structure are added for feature enhancement, so that the accuracy of small target detection and identification in an unmanned aerial vehicle image can be improved, the omission of small targets in target extraction is reduced, and the target detection precision is guaranteed.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings described below are only some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

Fig. 1 is a schematic flow chart of a small target detection method based on an unmanned aerial vehicle image according to an embodiment of the present invention;

FIG. 2 is a schematic diagram illustrating connection between a swin-Transformer structure and an SPPF structure according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of the small target detection effect according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a small target detection system based on an image of an unmanned aerial vehicle according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the embodiments described below are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be understood that the terms "comprises" and "comprising," when used in this specification or claims and in the accompanying drawings, are intended to cover a non-exclusive inclusion, such that a process, method or system, or apparatus that comprises a list of steps or elements is not limited to the listed steps or elements. In addition, "first" and "second" are used to distinguish different objects, and are not used to describe a specific order.

Referring to fig. 1, a schematic flow chart of a small target detection method based on an unmanned aerial vehicle image according to an embodiment of the present invention includes:

s101, obtaining an image acquired by an unmanned aerial vehicle, preprocessing the image, and calculating anchor values of targets with different sizes in the image through a clustering algorithm;

the method comprises the steps of obtaining an image collected by the unmanned aerial vehicle, wherein the image generally comprises specific imaging targets such as people, vehicles and the like. And preprocessing an original image acquired by the unmanned aerial vehicle, marking the positions of small targets in the image, and adding a label.

Wherein the pretreatment process at least comprises the following steps:

and carrying out zooming processing on the image in a preset section, and carrying out mosaic data enhancement processing on the zoomed image.

The predetermined interval is a zoom interval set according to actual needs, and the image can be enlarged and reduced in the interval, wherein the interval size is 0.5-1.5, and the original image can be enlarged to 1.5 times or reduced to 0.5 times.

The mosaics data enhancement processing is to randomly cut the selected picture and 3 random pictures and then splice the cut pictures to one picture as training data, so that the picture background can be enriched, and the network robustness can be enhanced.

And calculating a frame set of the target through a clustering algorithm for the preprocessed image, wherein the anchor value is the width and the height of a target frame matched with the target. The clustering algorithm can be a k-means and other clustering methods.

The objects can be classified according to the size of the object frame, for example, the objects are classified into small objects and conventional objects, and the conventional objects can also be further classified into three types of small, medium and large objects according to the size, so that four types of objects with the sizes can be obtained.

S102, extracting feature maps with different resolutions through a CSP-Darknet53 backbone feature extraction model;

the CSP-Darknet53 backbone feature extraction model is a backbone network in YOLOv4 and can be used for extracting image features, and CSPs (Cross-Stage-Partial-connections) are connected in a Cross-Stage mode.

Feature maps with different high and low resolutions can be extracted through a CSP-Darknet53 backbone feature extraction model.

S103, performing feature reinforcement on the feature map based on the swin-Transformer structure and the SPPF structure;

the swin-Transformer structure is composed of a multi-head attention module and comprises W-MSA and SW-MSA structures, and the W-MSA and the SW-MSA learn characteristic values in a window through a fixed window and an offset window respectively. The swin-Transformer V2 is an improvement on the swin-Transformer structure, and the parameter capacity, the window resolution and the like are enlarged.

The SPPF structure connects and cascades the input feature maps in parallel, carries out stacking processing of channel dimensions, and extracts feature map features through a CBL (Conv + BN + LeakyRelu) module.

Illustratively, referring to fig. 2, the connection between the swin-Transformer V2 structure and SPPF structure is shown, LN (Layer Normalization) indicates Layer Normalization, MLP (multilayered perptron) indicates Multilayer sensing, "+" indicates feature fusion, and Concat indicates channel dimension stacking.

S104, fusing different network layer feature information to extract target features based on feature maps with different resolutions and feature maps with enhanced features;

and fusing the feature maps with different scales output in the steps of S102 and S103 to extract the features of different regions of interest.

Specifically, feature maps in different stages are fused up and down through a bidirectional feature pyramid Module BiFPN, and a focused region of a dense object in an image is captured by adding a CBAM (conditional Block Attention Module) space and a channel Attention Module.

And S105, detecting and outputting the target position and the category information according to the extracted target characteristics.

And based on the detected and extracted features, after judging that the features are matched with the target features, outputting the position of the target in the image and the target classification. Specifically, information such as the position and the category of a small target in a large-size feature map, information such as the position and the category of a conventional target in the original image size, and the like can be detected.

The actual effect of the small target detection method and the small target detection based on the yolov5 model provided by the implementation is shown in fig. 3, wherein (a) in the figure is the detection effect of the small target detection method provided by the implementation, and (b) in the figure is the small target detection effect of the yolov5 model.

In the embodiment, in the output prediction stage of the model, on the prediction feature layer of the model, after the small target is subjected to feature extraction and enhancement, the feature size is further reduced, when the size of the set anchor frame is too large, the problem that the sample of the small target is filtered due to the undersize of the IOU of the prediction frame and the real frame is easy to occur, and aiming at the problem, the detection head of the small target is added during model prediction, and the problem that the feature layer of the small target is not predicted and the small target is missed is solved by redesigning the size of the anchor frame and adapting to the size of the small target.

In the feature extraction stage of the small target, the receptive field is increased, the size of the feature graph of the model is continuously reduced, and when the convolution kernel step exceeds the size of the small target, the features of the small target are difficult to forward propagate in the neural network. Aiming at the problem, by adding a Swin-transformer V2 structure and an SPPF structure in a model feature extraction backbone network, and by increasing the number of heads and increasing the size of a window, more fine and tiny target feature information is extracted, and the problem of small target feature extraction omission is solved.

In another embodiment of the present invention, a target size weighted loss function is defined, the predicted head loss of the predetermined target is increased, and the weight value of the predetermined target is increased.

The MLP prediction head is a 3-layer Perceptron, the activation function is ReLU, and the number of hidden nodes is d. Each Object queries predicts a bounding box and a class of objects by predicting a head, where the bounding box has three values, a center point and a width and a height of the Object, respectively.

Illustratively, a target size weighted loss function is designed, and the formula is as follows:

Loss＝λ ₁ L _cls +λ ₂ L _obj +λ ₃ L _loc (1)

Loss _{general assembly} ＝W ₁ Loss ₁ +W ₂ Loss ₂ +W ₃ Loss ₃ +W ₄ Loss ₄ (2)

Wherein Loss is Loss per probing head, including classification Loss L _cls Loss of confidence L _obj And regression loss L _loc Respectively adopting coefficients lambda 1, lambda 2 and lambda 3 with different weights; the Loss is total Loss of targets with different scale types, wherein Loss1, loss2, loss3 and Loss4 are respectively predicted head Loss of a small target and a conventional small, medium and large target, in order to increase the detection accuracy of the small target, the predicted head Loss of the small target of Loss1 is increased, and the weight value of W1 is increased, so that the detection accuracy of the small target is improved.

The loss functions with different weights are designed, and the problems of small contribution of the samples of the small target to the model and the reverse propagation gradient loss can be solved by increasing the loss weight of the small target layer.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.

Fig. 4 is a schematic structural diagram of a small target detection system based on an unmanned aerial vehicle image according to an embodiment of the present invention, where the system includes:

the preprocessing module 410 is used for receiving the images acquired by the unmanned aerial vehicle, preprocessing the images, and calculating anchor values of targets with different sizes in the images through a clustering algorithm;

wherein the preprocessing module 410 comprises:

a scaling processing unit for scaling the image within a predetermined section;

and the data enhancement unit is used for performing mosaic data enhancement processing on the zoomed image.

The feature extraction module 420 is used for extracting feature maps with different resolutions through a CSP-Darknet53 backbone feature extraction model;

the feature enhancement module 430 is configured to perform feature enhancement on the feature map based on a swin-Transformer structure and an SPPF structure;

in the SPPF structure, input feature maps are connected in parallel and cascaded, stacking processing of channel dimensions is carried out, and feature map features are extracted through a CBL module.

The enhanced feature extraction module 440 is configured to perform target feature extraction by fusing different network layer feature information based on feature maps with different resolutions and feature maps after feature enhancement;

specifically, feature maps in different stages are connected and fused up and down through a bidirectional feature pyramid BiFPN, and attention areas of dense objects in the images are extracted through a CBAM space and a channel attention module.

And the target prediction module 450 is configured to detect and output target position and category information according to the extracted target features.

In some embodiments, a loss function weighted by the size of the target is defined, the predicted head loss of the predetermined target is increased, and the weight value of the predetermined target is increased, thereby improving the detection accuracy of the small target.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the apparatus and the modules described above may refer to corresponding processes in the foregoing method embodiments, and are not described herein again.

Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention. The electronic equipment is used for detecting the small target in the unmanned aerial vehicle image. As shown in fig. 5, the electronic apparatus 5 of this embodiment includes at least: a memory 510, a processor 520, and a system bus 530, the memory 510 including an executable program 5101 stored thereon, it being understood by those skilled in the art that the electronic device structure shown in fig. 5 does not constitute a limitation of an electronic device and may include more or fewer components than shown, or some components in combination, or a different arrangement of components.

The following describes each component of the electronic device in detail with reference to fig. 5:

the memory 510 may be used to store software programs and modules, and the processor 520 may execute various functional applications and data processing of the electronic device by operating the software programs and modules stored in the memory 510. The memory 510 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as cache data) created according to the use of the electronic device, and the like. Further, the memory 510 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

On the memory 510 is embodied a executable program 5101 of a network request method, the executable program 5101 may be divided into one or more modules/units, which are stored in the memory 510 and executed by the processor 520 to implement a predetermined small target detection or the like, the one or more modules/units may be a series of computer program instruction segments capable of performing a specific function for describing an execution process of the computer program 5101 in the electronic device 5. For example, the computer program 5101 may be partitioned into a preprocessing module, a feature extraction module, a feature enhancement module, an enhanced feature extraction module, a target prediction module, and the like.

The processor 520 is a control center of the electronic device, connects various parts of the whole electronic device using various interfaces and lines, performs various functions of the electronic device and processes data by operating or executing software programs and/or modules stored in the memory 510 and calling data stored in the memory 510, thereby performing overall status monitoring of the electronic device. Alternatively, processor 520 may include one or more processing units; preferably, the processor 520 may integrate an application processor, which mainly handles operating systems, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into processor 520.

The system bus 530 is used to connect various functional units inside the computer, and CAN transmit data information, address information, and control information, and may be, for example, a PCI bus, an isa bus, a CAN bus, etc. The instructions of the processor 520 are transferred to the memory 510 through the bus, the memory 510 feeds data back to the processor 520, and the system bus 530 is responsible for data and instruction interaction between the processor 520 and the memory 510. Of course, the system bus 530 may also access other devices, such as network interfaces, display devices, and the like.

In this embodiment of the present invention, the executable program executed by the process 520 included in the electronic device includes:

extracting feature maps with different resolutions in the image through a CSP-Darknet53 backbone feature extraction model;

performing feature reinforcement on the feature map based on the swin-Transformer structure and the SPPF structure;

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A small target detection method based on unmanned aerial vehicle images is characterized by comprising the following steps:

2. The method of claim 1, wherein preprocessing the image comprises:

3. The method of claim 1, wherein the SPPF structure maximally pools the input feature maps, stacks the pooled feature maps, and extracts feature map features through a CBL module.

4. The method according to claim 1, wherein fusing feature information of different network layers for target feature extraction based on feature maps with different resolutions and feature maps after feature enhancement comprises:

and feature maps in different stages are vertically connected and fused through a bidirectional feature pyramid BiFPN, and attention areas of dense objects in the images are extracted through a CBAM (cubic boron nitride) space and channel attention module.

5. The method of claim 1, wherein the detecting output target location and category information based on the extracted target features further comprises:

a loss function weighted by the size of the target is defined, the predicted head loss of the predetermined target is increased, and the weight value of the predetermined target is increased.

6. A small target detection system based on unmanned aerial vehicle image, its characterized in that includes:

7. The system of claim 6, wherein the pre-processing module comprises:

a scaling processing unit for scaling the image within a predetermined section;

8. The system of claim 6, wherein the SPPF structure maximally pools the input feature maps, stacks the pooled feature maps, and extracts feature map features through the CBL module.

9. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor when executing the computer program implements the steps of a drone image based small object detection method according to any one of claims 1 to 5.

10. A computer-readable storage medium storing a computer program, wherein the computer program is configured to implement the steps of the method for detecting small objects based on images of drones according to any one of claims 1 to 5 when executed.