CN114943903B

CN114943903B - Self-adaptive clustering target detection method for aerial image of unmanned aerial vehicle

Info

Publication number: CN114943903B
Application number: CN202210572768.0A
Authority: CN
Inventors: 李云; 王学军
Original assignee: Guangxi University of Finance and Economics
Current assignee: Guangxi University of Finance and Economics
Priority date: 2022-05-25
Filing date: 2022-05-25
Publication date: 2023-04-07
Anticipated expiration: 2042-05-25
Also published as: CN114943903A

Abstract

The invention provides a self-adaptive clustering target detection method for aerial images of an unmanned aerial vehicle, which comprises the following steps: segmenting an image based on a self-adaptive clustering network, and correcting the segmented image; constructing a local detection network and a global detection network; detecting the corrected image based on the local detection network; detecting a detection target of the missed detection of the local detection network based on the global detection network; and fusing the detection result of the local detection network and the detection result of the global detection network to obtain a self-adaptive clustering target detection result of the aerial image of the unmanned aerial vehicle. The method is used for improving the precision of the target detection algorithm of the aerial image of the unmanned aerial vehicle, and particularly greatly improves the detection precision of small targets.

Description

Self-adaptive clustering target detection method for aerial image of unmanned aerial vehicle

Technical Field

The invention belongs to the field of target detection of aerial images, and particularly relates to a self-adaptive clustering target detection method for aerial images of unmanned aerial vehicles.

Background

In recent years, with the iterative update of an electronic communication technology and the continuous maturity of a machine intelligent technology, the unmanned aerial vehicle industry enters a rapid development stage, and the unmanned aerial vehicle has deep application in the fields of power inspection, environmental protection, biological detection, logistics transportation, post-disaster rescue, data acquisition, mobile communication and the like. In a period of time in the future, the deep fusion of the unmanned aerial vehicle technology, the artificial intelligence technology and the new generation communication technology can further solve the problem of the unmanned aerial vehicle existing in the current industrial production, and can also promote the unmanned aerial vehicle to be applied to fall to the ground in the new field. Unmanned aerial vehicle's wide application in the society has improved production efficiency to a great extent, also very big reduction the consumption of manpower, material resources, financial resources, unmanned aerial vehicle plays more and more important effect in current society.

At present, a target Detection (Object Detection) algorithm based on deep learning can maintain high Detection performance, and for common scenes, such as scenes with a relatively single background, a small number of targets, a large target size, a horizontal picture-taking view angle, and the like, a classical target Detection algorithm can maintain high Detection accuracy on the premise of ensuring Detection speed. In addition to common shooting scenes, target detection of Aerial images (Aerial image) has become a hot issue of research at home and abroad. The aerial image is divided into a satellite image shot by a satellite and an unmanned aerial vehicle image shot by an unmanned aerial vehicle, and the aerial image shot by the satellite has the characteristics of large size, fixed shooting angle, small occupation ratio of a target in the image and the like. Because the unmanned aerial vehicle image receives the restriction of factors such as shooting equipment and environment, its image is more complicated, and the content is abundanter. Compared with satellite images, unmanned aerial vehicle images are more widely applied to civil and military. The potential information of the unmanned aerial vehicle image is deeply mined, and the method has important significance for deeper application of the unmanned aerial vehicle in various social fields.

Disclosure of Invention

In order to solve the technical problems, the invention provides a self-adaptive clustering target detection method for an aerial image of an unmanned aerial vehicle, which is used for improving the precision of a target detection algorithm of the aerial image of the unmanned aerial vehicle, and particularly greatly improving the detection precision of small targets.

In order to achieve the purpose, the invention provides a self-adaptive clustering target detection method for an aerial image of an unmanned aerial vehicle, which comprises the following steps:

segmenting the aerial image of the unmanned aerial vehicle based on the self-adaptive clustering network to obtain an image to be detected;

correcting the image to be detected;

constructing a local detection model and a global detection model;

detecting the corrected image to be detected based on the local detection model;

detecting a detection target missed by the local detection model based on the global detection model;

and fusing the detection result of the local detection network and the detection result of the global detection network to obtain the self-adaptive clustering target detection result of the aerial image of the unmanned aerial vehicle.

Optionally, the adaptive clustering network includes: a feature extraction sub-network, a clustering region suggestion sub-network and a clustering region modification sub-network.

Optionally, segmenting the drone aerial image comprises:

positioning the position of a target in the aerial image of the unmanned aerial vehicle to obtain a positioning area;

carrying out target identification on the positioning area to obtain a potential area of an aggregation target;

and performing regression prediction on the aggregation target potential region, and segmenting the aggregation target potential region subjected to regression prediction to obtain the image to be detected.

Optionally, the correcting the image to be detected includes:

detecting the size of the image to be detected, uniformly dividing the image to be detected with the size exceeding a first preset value, and filling the image to be detected with the length-width ratio being out of balance.

Optionally, an attention mechanism, a variable threshold, a sample balancing strategy is introduced in the local detection network.

Optionally, the local detection model includes: a clustering image input sub-network, a channel attention and space attention sub-network and a detection sub-network;

the clustered image input sub-network is used for carrying out feature extraction on the corrected image to be detected, the channel attention and space attention sub-network is used for carrying out category fusion on the feature map to obtain a feature map to be detected, and the detection sub-network is used for detecting the feature map to be detected and outputting a local detection result.

Optionally, the attention mechanism comprises: a channel attention mechanism and a space attention mechanism;

the attention mechanism comprises the following operation processes: firstly, performing global maximum pooling and global average pooling on feature maps of all channels, then performing compression and activation by using a compression-activation module, and finally outputting by using a sigmoid activation function.

Optionally, the sample balancing policy is:

s1, counting the number proportion and the area proportion of various targets in an aerial image data set of the unmanned aerial vehicle;

s2, judging the quantity ratio and the area ratio to obtain an amplification type;

s3, selecting any one corrected image to be detected, skipping the image when the total target number of the amplification categories in the corrected image to be detected is smaller than a second preset value, and classifying the targets needing to be amplified by using a mean shift clustering algorithm when the total target number is larger than or equal to the second preset value to obtain a classification result;

s4, respectively using boundary targets contained in each category in the classification result as boundaries, framing the area, stretching the area in an equal ratio to the size of the area of the original image, generating a new labeling file, judging the ratio of the intercepted target area to the original target area, and labeling based on the judgment result;

and S4, repeating the step S3 until all the pictures in the aerial image data set of the unmanned aerial vehicle are traversed, and obtaining an expanded training set.

Optionally, the variable threshold is:

wherein TH is a variable threshold value, TH _low 、TH _high For the set minimum and maximum thresholds, U and I are the union area and the intersection area of the current prediction bounding box and the other prediction bounding boxes, respectively.

Compared with the prior art, the invention has the following advantages and technical effects:

1. the invention discloses a target detection problem facing an unmanned aerial vehicle image, and provides a self-adaptive clustering target detection method aiming at an unmanned aerial vehicle aerial image in order to improve the detection precision of small targets in the unmanned aerial vehicle image. Considering that the target detection area of the aerial image only occupies a part of the whole image, a self-adaptive clustering algorithm is provided to segment the image and finely detect the target gathering area, so that the target detection precision and the detection efficiency are improved.

2. Two problems are considered in the picture segmented by using the self-adaptive clustering network: firstly, the method comprises the following steps: the picture size is too large. For a detection network, the problem that a small target is difficult to detect still exists when an oversized picture is input; secondly, the method comprises the following steps: the ratio of the length to the width of the picture is maladjusted, and the detection precision of the picture is reduced after the picture is input into a detection network. In order to avoid detection difference caused by extreme input, a scale estimation strategy is used for processing the size of the segmented picture, and a segmentation and filling method is used for correcting the segmented image.

3. In order to improve the detection precision of the model, an attention mechanism is introduced, and a local detection network for a clustering region is provided, and the segmented images and the original images are detected by using the local detection network, wherein the local detection network can be any target detection algorithm, and a backbone network of the local detection network is used for improving the detection precision of the model by introducing the attention mechanism, using an NMS (network management system) with a variable threshold value, using a sample balance strategy and the like. And considering the possible target truncation condition in the self-adaptive clustering process and the condition that the target distribution in the original image is sparse, a global detection network for detecting large and medium targets is trained, and the detection result of the network supplements the detection result of the local detection network.

4. The method solves the difficult problems encountered by target detection of aerial images of the unmanned aerial vehicle from two aspects of enhancing data and improving a network structure, and the provided self-adaptive clustering method can effectively improve the detection performance of the network, is beneficial to improving the precision of a target detection algorithm of the aerial images of the unmanned aerial vehicle, and particularly greatly improves the detection precision of small targets.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application. In the drawings:

fig. 1 is a schematic diagram of a method for detecting a self-adaptive clustering target for an aerial image of an unmanned aerial vehicle according to embodiment 1 of the present invention;

fig. 2 is a schematic diagram of a feature extraction network structure according to embodiment 1 of the present invention;

fig. 3 is a schematic diagram of 9 candidate regions generated by the network according to embodiment 1 of the present invention;

FIG. 4 is a schematic Gaussian function chart of example 1 of the present invention;

fig. 5 is a schematic diagram of a cluster region correction network structure according to embodiment 1 of the present invention;

FIG. 6 is a schematic diagram of a pair of cluster region sizes before and after filling in example 1 of the present invention;

fig. 7 is a schematic diagram of a local detection network structure according to embodiment 1 of the present invention;

FIG. 8 is a schematic diagram of a channel attention network structure according to embodiment 1 of the present invention;

fig. 9 is a schematic structural diagram of a spatial attention network according to embodiment 1 of the present invention.

Detailed Description

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowcharts, in some cases, the steps illustrated or described may be performed in an order different than presented herein.

Example 1

As shown in fig. 1, the present embodiment provides a method for detecting an adaptive clustering target for an aerial image of an unmanned aerial vehicle, including:

segmenting an image based on a self-adaptive clustering network, and correcting the segmented image;

constructing a local detection network and a global detection network;

detecting the corrected image based on the local detection network;

detecting a detection target of the missed detection of the local detection network based on the global detection network;

and fusing the detection result of the local detection network and the detection result of the global detection network to obtain a self-adaptive clustering target detection result of the aerial image of the unmanned aerial vehicle.

Furthermore, the invention provides a self-adaptive clustering target detection method for aerial images of unmanned aerial vehicles, aiming at improving the detection precision of small targets in the aerial images of unmanned aerial vehicles. The target detection area of the aerial image only occupies a part of the whole image, so that the image is segmented by using a self-adaptive clustering algorithm, and the target gathering area is subjected to fine detection, thereby improving the target detection precision and the detection efficiency.

The process of segmenting the image by the self-adaptive clustering network is roughly divided into the following steps: roughly positioning a target position; and carrying out accurate identification on the target in the positioning area. After the network is trained and optimized, regression prediction is carried out on the aggregation target potential areas in the image, then the areas are segmented, and the segmented image is sent to a subsequent detection network for detection.

In the embodiment, the image is segmented by a self-adaptive clustering network, and the segmentation is mainly realized by a feature extraction sub-network, a clustering region suggestion sub-network and a clustering region modification sub-network;

the network for feature extraction adopts improved DetNet59 as a backbone network, and the improved network combines a spatial pyramid structure (FPN) and a channel attention mechanism. The network structure is shown in fig. 2, and the C4, C5 and C6 layers do not use the operation of down-sampling, so as to: the feature map is kept at the high resolution of '16 x', and the high-resolution feature map can keep more small targets and is beneficial to regression of target positions. However, since the field of view of the image is reduced without using downsampling, the network expands the field of view of the feature map by introducing hole convolution, and the number of channels of the output feature map per layer of the network is fixed to 256 in order to reduce the number of network parameters. And then, fusing the feature maps output by different layers by using a feature pyramid network, wherein the fused P2 and P3 layer feature maps reserve more large target features, and the P4, P5 and P6 layer feature maps reserve more small target features. And finally, after each layer of the feature pyramid is output, the channel attention network is cascaded, and all feature graphs are output in parallel.

In fig. 2, "C1" represents the output of the first volume block, "4x" represents the output resolution of the feature map, which is one quarter of the original picture resolution, "P2" represents the result of "C2" merging with the feature map of the previous layer, and so on.

The clustering region suggests that the network draws the idea of RPN in Faster R-CNN, the object of RPN network regression and classification is each target, and the object of clustering region suggests that network regression classification is each clustering region. From the 256 mxn feature maps output at the feature extraction stage, the receptive field of the feature maps was further expanded using a 3 × 3 convolution kernel with an expansion ratio of 2, and the output remained 256 × mxn. These feature maps are regarded as M × N256-dimensional vectors, each of which corresponds to 9 candidate regions in the original image, and as shown in fig. 3, the 9 regions are defined by three sizes, 1: 1. 1:2 and 2:1 in total three proportions, so a total of M × N × 9 regions. And carrying out two-classification and regression operations on the regions to obtain the prediction information of the clustering regions. The specific operation is as follows: the 256M multiplied by N feature maps are subjected to two-branch operation, one branch carries out two-classification on the regions, and the purpose of the two-classification is as follows: separating a foreground and a background, and outputting two confidence scores by using a 1 multiplied by 1 convolution kernel, wherein 9 multiplied by 2 multiplied by M multiplied by N data are output in total; the other branch regresses the clustered region, using a 1 × 1 convolution kernel to obtain 9 × 4 × M × N data. And finally, screening out the areas predicted as the foreground, and fusing the areas by using a non-maximum value suppression algorithm (NMS) to obtain an output result.

The algorithm labeling can avoid errors caused by manual labeling, and a clustering region correction network is provided for correcting the errors caused by the algorithm labeling, and the network completes the determination of the number of clustering regions by realizing a classification task. The specific process of network training is as follows: obtaining labeling information of the picture according to the labeling file, wherein the number of clustering regions in the picture is N, constructing an M-dimensional vector of which the probability distribution obeys one-dimensional Gaussian distribution according to N, M is the maximum value of the number of clustering regions, M is an empirical value, and taking M =10 by counting and calculating a VisDrone 2019 data set in the embodiment. The gaussian function is expressed as formula (1):

in equation (1), μ = N and σ =1 are taken, samples are taken at integer positions of the function, and a probability distribution is constructed from the sampled values. For example, when N =3, that is, the number of clustering regions is 3, the constructed gaussian function takes values at the integer position of the gaussian function as shown in fig. 4, and the probability distribution after quantization is:

p＝[0.054 0.24 0.40 0.24 0.054 4.4e-3 1.3e-4 1.5e-6 6.1e-9 9.1e-12]

this probability distribution is normalized using the softmax function, adding all values to 1, the normalization resulting in the following vector:

p _true ＝[0.095 0.114 0.134 0.114 0.095 0.090 0.090 0.090 0.090 0.090]

the vector is used as a real label of a training clustering area quantity correction network, and a cross entropy function is used as a loss function during training.

The structural diagram of the cluster area number correction network is shown in fig. 5, a feature graph output by feature extraction is convolved by using a 3 × 3 hole with an expansion rate of 3, then dimension reduction is performed by using a 1 × 1 convolution kernel to obtain 10 mxn feature graphs, then global maximum pooling is performed, finally a full connection layer is connected, and output is performed by using a softmax function. The clustering number can be determined by correcting the output of the network according to the clustering area number, and the output result of the self-adaptive clustering network can be determined by combining the output of the clustering area suggestion network.

Further, the image after segmentation is corrected: and in the segmented images, the images with the sizes exceeding the preset value are uniformly segmented, the images with the maladjusted length-width ratios are filled, and the images with the small sizes are filled.

In this embodiment, the picture segmented by using the adaptive clustering network has two problems: first, the picture size is too large. For a detection network, the problem that a small target is difficult to detect still exists when an oversized picture is input; secondly, the picture length-width ratio is not adjusted, and the detection precision of the picture is reduced after the picture is input into the detection network. In order to avoid the detection difference caused by such extreme input, the present embodiment uses a scale estimation strategy to process the size of the segmented picture, i.e. the segmented image is corrected by using a segmentation and filling method. The scale estimation strategy is divided into the following steps:

(1) Thresholds Pm and Ps are determined according to the required input scale of the detector, pm representing the maximum value of the input size and Ps representing the minimum value of the input size. Wherein Ps and Pm are hyper-parameters.

(2) Let the length and width of the input picture be w and h, respectively. When the size w of the long edge of the picture is larger than Pm, uniformly dividing the picture; when the short side size h of the picture is smaller than Ps, calculating the required filling proportion, wherein the calculation formula is as follows (2):

ti is the filling ratio, i.e. filling is performed on the basis of the divided picture, the size after filling is (1 + Ti) times of the original picture, and FIG. 6 is comparison before and after filling.

(3) And (5) traversing all the segmented pictures, and executing the operation (2) to obtain the processed pictures.

Further, the detection network is constructed, and the local detection network and the global detection network are constructed. Detecting the segmented image and the original image by using a detection network:

in this embodiment, the detection network may be any target detection algorithm, and the main network of the detection network, i.e. the local detection network, is used to improve the detection accuracy of the model by introducing attention mechanism, NMS using variable thresholds, using sample balancing strategy, and so on.

The local detection network structure is shown in fig. 7. Firstly, clustering image input network carries out feature extraction, extracted feature maps are respectively input into a channel attention network and a space attention network, then FPN is used for inputting the fused feature maps into a detection network, and variable threshold NMS is used for fusing detection results at the output end of the detection network. The detector can be any target detection network, and the detection network uses a sample balance strategy to enhance the data of the training set before training.

Considering that the small target in the clustering region is detected as an extremely difficult target, a two-channel attention trunk network is formed by introducing a channel attention mechanism and a space attention mechanism, so that the dispersion of the small target characteristics on a characteristic diagram is reduced, and the detection precision of the small target is improved.

The feature extraction network of the detection network still uses a modified DetNet59, followed by a two-channel attention network and then cascaded with the FPN network outputs. The following mainly describes the implementation process of the channel attention module and the spatial attention module.

The main integrated operation of the channel attention mechanism includes that the feature maps of all channels are subjected to global maximum pooling and global average pooling, then a compression-activation module is used, and finally a sigmoid activation function is used for outputting, wherein the network structure of the channel attention mechanism is shown in fig. 8. The specific operation comprises the following steps:

(1) Assuming that the dimension of the input feature map is { H, W, C }, wherein H represents the height of the feature map, W represents the width of the feature map, and C represents the number of channels of the feature map, after the operation of two branches of global maximum pooling and global average pooling, the feature map of each channel is compressed to 1, namely the dimension of the output feature map is {1, C };

(2) The output characteristic diagram is connected to a full connection layer as an input, the two characteristic diagrams share parameters of the full connection layer, the parameters are reduced in the full connection layer in a compression and activation mode, the compression rate is set to be r, and the dimensionality of the characteristic diagram output by the operation is {1, C/r };

(3) After passing through the full connection layer, the output feature graph already acquires the importance of different channels, then the two branches are added and fused to obtain a feature graph, and finally a sigmoid activation function is used for obtaining output.

The spatial attention network structure is shown in fig. 9, and the dimensions of the feature map are still H, W, C. And the spatial attention module performs feature extraction of a spatial region on the feature map by using global maximum pooling and global average pooling, outputs two feature maps, fuses the two feature maps by using convolution operation, and finally outputs the feature map by using a sigmoid activation function. The specific operation is as follows:

(1) Regarding the input feature map as H multiplied by W C-dimensional feature vectors, and performing global maximum pooling and global average pooling on the H multiplied by W feature vectors to obtain two feature maps with dimensions { H, W,1 };

(2) Splicing the two feature maps to obtain a feature map with a dimension { H, W,2}, convolving the feature map by using a 7 multiplied by 7 convolution kernel, and then obtaining a feature map with a dimension { H, W,1 };

(3) And finally, outputting a feature map by using a sigmoid activation function, wherein the feature map already acquires the information of the importance of different spatial regions.

The non-maximum suppression (NMS) algorithm and the NMS improvement-based algorithm are used in the final stage of the detection process, and after all the prediction bounding boxes are regressed, each type of targets in the picture are fused by using the NMS algorithm, so that the multiple prediction boxes of the same target are prevented. The basic steps of the NMS algorithm are as follows:

(1) Dividing all the prediction bounding boxes into different sets according to categories, and arranging the prediction bounding boxes in each set in a descending order according to scores;

(2) Selecting the prediction boundary frame with the maximum score from the set, calculating the IoU of the boundary frame and other prediction boundary frames in the set, if the IoU is greater than a set threshold value TH, removing the prediction boundary frame with the smaller score, and finally reserving the boundary frame;

(3) Repeating the operation (2) on the residual prediction bounding boxes in the set until all the prediction boxes in the set are screened;

(4) And (4) repeating the operations (2) and (3) on each set until all sets are screened.

The above is a standard NMS algorithm, the expression of the inhibition function is as follows (3):

m represents the prediction bounding box with the largest current score, bi represents other prediction bounding boxes, si represents the scores of other prediction bounding boxes, and TH is a set threshold.

And the NMS is used for detecting the pictures with sparsely distributed targets with good effect. However, when the distance between multiple objects is small in the picture, the NMS algorithm may misunderstand the prediction bounding box of a different object as the prediction bounding box of the same object, so as to eliminate the prediction bounding box, and thus the recall rate of the detection result may be reduced. In the unmanned aerial vehicle image, a large number of target gathering areas exist. In order to solve the above problem, a variable threshold NMS algorithm is proposed based on the NMS algorithm, and the variable threshold is calculated according to formula (4):

TH _low 、TH _high for the set minimum and maximum thresholds, U and I are the union area and intersection area of the current prediction bounding box and the other prediction bounding boxes, respectively. When the value of (U-I) is larger, the current prediction bounding box is a large target which is easy to detect for a detector, the calculated TH value is smaller, and when the NMS algorithm is used, the current prediction bounding box has smaller inhibiting effect on the fusion of other prediction bounding boxes; when the value of (U-I) is small, the small target is predicted by the current prediction boundary box, the small target is a target which is difficult to detect with a detector, the calculated TH value is large, and when the NMS algorithm is used, the current prediction boundary box has a large inhibiting effect on the fusion of other prediction boundary boxes.

The threshold value can be changed in a self-adaptive mode through the NMS algorithm with the variable threshold value according to the size of the target, the missing rate of the small target by the detector is effectively reduced, and therefore the recall rate and the detection precision of the detection result are improved.

Considering that the number of images in the current aerial image data set is small, the deviation of a training set sample has a great influence on a detection model, and the distribution of various targets of the data is in long-tail distribution, namely the area ratio of various targets in the training set is greatly different, a sample balancing strategy is provided for amplifying the area ratio and the number ratio of small sample data, so that the influence of long-tail distribution on the accuracy of the detection model is reduced.

The concrete steps of the sample balancing strategy are as follows:

counting the number proportion and the area proportion of various targets in the data set;

and determining the category of which the number ratio and the face ratio are respectively smaller than NC and SC as the category needing to be amplified. NC and SC are hyper-parameters;

selecting a picture, and skipping the picture when the total target number of the categories to be amplified in the picture is less than N; and when the total target number is more than or equal to N, classifying the targets needing to be amplified by using a mean shift clustering algorithm. N is a hyper-parameter;

respectively taking the boundary target contained in each category as a boundary, framing the region, stretching the region in an equal ratio to obtain the area size of the original image, and generating a new labeling file, wherein if the area ratio of the intercepted target exceeds 50% of the original target area, the target is labeled as a positive sample, otherwise, the target is not labeled;

and (5) repeating the step (3) until all the pictures in the data set are traversed, and finally obtaining the expanded training set.

Furthermore, after the adaptive clustering network clusters the original image, the clustering area is detected, but in the clustering process, large and medium targets are cut off by the clustering network with high probability, and the cut-off targets become targets which are difficult to detect. In addition, sparsely distributed targets still exist in the original picture, and the large and medium targets still occupy a large proportion. In order to solve the above problem, a global detection network for large and medium targets is trained to detect the large and medium targets which may be truncated. The global detection network can adopt any classical network, the invention selects and uses fast R-CNN as the global detection network, and the backbone network uses the improved DetNet59 of the invention. The network only detects the medium and large targets, so that the small targets in the data set are ignored in the training process, and only the large and medium targets are trained. The detection result of the network supplements the detection result of the local detection network. And fusing the results of the local detection network and the global detection network to obtain a final detection result. Finally, the method is used for self-adaptive clustering target detection aiming at the aerial image of the unmanned aerial vehicle.

Example 2

In this embodiment 2, three different data sets are used for testing, and the testing is performed on the VisDrone, DOTA, and UAVDT data sets respectively.

The VisDrone 2019 dataset was collected by the AISKYEYE team, the university of tianjin machine learning and data mining laboratory, and the entire reference dataset was captured by the drone, including 288 video clips, for a total of 261908 frames and 10209 still images. From 2018 onwards, the authorities will update the data sets and set up corresponding computer vision challenges each year, the content of which includes: target detection, target tracking, and population counting. The present embodiment uses the VisDrone 2019 data set as the object detection data set, where the data set is labeled with 10 kinds of objects including pedestrians, people, bicycles, automobiles, vans, trucks, tricycles with awning, public cars and motorcycles, and a total of 8629 pictures, and the picture scene includes streets, parks, suburbs, and night blocks, besides, the pictures include a large number of small objects with occlusion.

The DOTA data set was proposed by Wuhan university in 2017, and images were taken by Google Earth, JL-1 satellites, china resource satellite data and GF-2 satellites of the application center. The data set includes 15 types of targets, such as an athletic field, a football field, an airplane, a ship, a swimming pool, and the like, for a total of 2806 pictures. The size of the DOTA data set picture is large, and the shooting angle is the shooting from the top view right above the object.

The UAVDT dataset is an unmanned aerial vehicle image dataset comprising target tracking and target detection data, and the target detection dataset comprises 40735 total pictures including three types of targets, including cars, buses, and trucks. The target detection data set is obtained by intercepting fixed video frames by a target tracking video.

The experimental results are shown in tables 1, 2 and 3, respectively.

Table 1 shows the experimental results in the VisDrone test set, where the backbone network represents the network structure used for feature extraction; the "o", "c" and "oc" of the division mode respectively represent original image, uniform division and cluster division; APs, APm, and APl represent detection accuracy for a small target, a medium target, and a large target, respectively. In order to verify the detection effect of the clustering detection network, different segmentation modes are respectively set for comparison and verification in experiments, wherein the segmentation mode is an original image which represents 548 pictures adopting a test set and is not subjected to any treatment; the segmentation mode is uniform segmentation, which means that all pictures of the test set are uniformly segmented into 6 equal parts; the segmentation mode is a clustering mode, which means that a clustering method is adopted to segment pictures of a test set, when the ClusDet clusters the pictures, the clustering number is fixed as a hyper-parameter, the hyper-parameter is determined before a network is trained, and the clustering number of the self-adaptive clustering network provided by the invention is self-adaptively output by the network. The detection result shows that the detection algorithm of the invention is greatly improved compared with the ClusDet algorithm for the detection effect of small targets.

TABLE 1

Table 2 shows the results of different methods for the DOTA data set. The data show that the accuracy of detection of medium targets using the uniform segmentation method is rather higher than that of the clustering method because: images in the DOTA data set are shot by a satellite, the size of the images is large, the area ratio and the number ratio of the medium targets are the highest, the ratio of the medium targets in the uniformly divided images is far higher than that of the small targets and the large targets, and the recall rate of the detection results of the medium targets is also improved. Generally, the difference between the average precision of the self-adaptive clustering detection algorithm and the ClusDet is not large, the detection precision of small targets is improved, and the detection precision of large and medium targets is reduced.

TABLE 2

Table 3 shows the results of different methods in the UAVDT data set. The data in the table show that the detection precision of the self-adaptive clustering algorithm on large, medium and small targets is higher than that of the ClusDet, and the detection effect of the self-adaptive clustering algorithm is better than that of other methods on the whole.

TABLE 3

By combining the results, the self-adaptive clustering detection algorithm provided by the invention has a good detection effect on the unmanned aerial vehicle image. The effect of detecting small and medium targets in the unmanned aerial vehicle image is greatly improved, and the detection effect of large and medium targets is not greatly improved. In general, the detection precision of the self-adaptive clustering algorithm provided by the invention is improved to a certain extent.

The above description is only for the preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A self-adaptive clustering target detection method for aerial images of unmanned aerial vehicles is characterized by comprising the following steps:

correcting the image to be detected;

constructing a local detection model and a global detection model; introducing an attention mechanism, a variable threshold value and a sample balance strategy into the local detection model; the local detection model includes: a clustering image input sub-network, a channel attention and space attention sub-network and a detection sub-network;

the clustering image input sub-network is used for extracting the characteristics of the corrected image to be detected, the channel attention and space attention sub-networks are subjected to category fusion to obtain a characteristic diagram to be detected, and the detection sub-network is used for detecting the characteristic diagram to be detected and outputting a local detection result;

fusing the detection result of the local detection model and the detection result of the global detection model to obtain a self-adaptive clustering target detection result of the aerial image of the unmanned aerial vehicle;

the sample balancing strategy is:

s4, respectively using boundary targets contained in each category in the classification result as boundaries, framing the area, stretching the area in an equal ratio mode to obtain the area size of the original image, generating a new labeling file, judging the ratio of the intercepted target area to the original target area, and labeling based on the judgment result;

s4, repeating the step S3 until all the pictures in the unmanned aerial vehicle aerial image data set are traversed, and obtaining an expanded training set;

the variable threshold is:

wherein TH is a variable threshold value>

、/>

For the set minimum and maximum thresholds, U and I are the union area and the intersection area of the current prediction bounding box and the other prediction bounding boxes, respectively.

2. The method of claim 1, wherein the adaptive clustering network comprises: a feature extraction sub-network, a clustering region suggestion sub-network and a clustering region modification sub-network;

the feature extraction sub-network is used for extracting a feature map of the aerial image of the unmanned aerial vehicle and outputting the feature map in parallel;

the clustering region suggestion subnetwork marks the characteristic graph to obtain the number of clustering regions;

and the clustering region correction sub-network determines the output result of the self-adaptive clustering network according to the feature map and the number of the clustering regions.

3. The method of claim 1, wherein segmenting the aerial image of the drone comprises:

carrying out target identification on the positioning area to obtain a gathering target potential area;

4. The method of claim 1, wherein the modifying the to-be-detected image comprises:

5. The method for adaptive clustering target detection for aerial images of unmanned aerial vehicles according to claim 1, wherein the attention mechanism comprises: a channel attention mechanism and a space attention mechanism;

the operation process of the attention mechanism comprises the following steps: firstly, performing global maximum pooling and global average pooling on feature maps of all channels, then performing compression and activation by using a compression-activation module, and finally outputting by using a sigmoid activation function.