CN115131687A

CN115131687A - Anchor-free aerial image detection method based on semantic information and attention mechanism

Info

Publication number: CN115131687A
Application number: CN202210740512.6A
Authority: CN
Inventors: 刘宁钟; 周王成; 吴磊; 王淑君
Original assignee: Jiangsu Lemote Technology Corp ltd; Nanjing University of Aeronautics and Astronautics
Current assignee: Jiangsu Lemote Technology Corp ltd; Nanjing University of Aeronautics and Astronautics
Priority date: 2022-06-27
Filing date: 2022-06-27
Publication date: 2022-09-30

Abstract

The invention discloses an anchor-free aerial image vehicle detection method based on semantic information and an attention mechanism, which comprises the following steps: firstly, acquiring aerial images of vehicles, and labeling the vehicles in the aerial images; secondly, adding semantic information fusion, attention mechanism and dynamic activation function on the basis of the FCOS network to construct a new network structure; sending the aerial image data set into a neural network for training until the network converges; and then detecting the vehicle in the test image by using the trained neural network and the weight file, and outputting a detection result. The invention solves the problem of low vehicle identification accuracy in the current aerial image, and the detection precision on the general data set DLR-3K reaches 89.6. The invention improves the accuracy of vehicle detection under the aerial image.

Description

Anchor-free aerial image detection method based on semantic information and attention mechanism

Technical Field

The invention relates to an anchor-free aerial image detection method based on semantic information and an attention mechanism, and belongs to the technical field of computer vision.

Background

At present, with the continuous maturity of unmanned aerial vehicle technique, the image of taking photo by plane of high resolution obtains more and more easily. Vehicle detection under the aerial image has received extensive attention in the remote sensing field, because it has important meaning to intelligent transportation, parking stall management, city planning, traffic monitoring, unmanned aerial vehicle drives and military affairs etc.. In an intelligent traffic system, ground vehicles can be detected, the road surface condition is analyzed, then a driving route is optimized, the traffic jam rate is reduced, and the traveling is facilitated. In the military field, the unmanned aerial vehicle detects the ground camouflage target, finds the suspicious target and then conveniently implements accurate striking. Vehicle detection in high resolution aerial images remains a challenging task, complex background environments, occlusion of tree and house shadows, small and dense targets, highly unbalanced target distribution density, and large numbers of similar structures.

In recent years, deep learning has been rapidly developed, especially for target detection algorithms such as RCNN series, SSD, YOLO series, RetinaNet, and the like. Although CNN-based algorithms have had great success, these algorithms all require artificial setting of the size, aspect ratio and number of anchors, and the results are greatly affected by anchors. The anchors are fixed in size, limiting the generalization capability of the detector, especially for small targets. These algorithms generate a large number of anchor boxes on the image to improve the recall rate, but most of the anchor boxes belong to negative samples, so that the positive and negative samples are unbalanced, and the calculation amount and the size of the model are increased. In addition, in the aerial image, the background usually occupies most of the image, and the foreground occupies a small proportion, so that the feature map extracted from the backbone network has much noise. The existing algorithms are difficult to focus on targets, but focus on complex backgrounds, and therefore the existing model detection accuracy is low.

In the aspect of feature extraction, the VCSOP connects the deep and shallow features of the residual network together through a feature pyramid fusion strategy. On the connected characteristics, four convolutional layers are connected in parallel to predict vehicle characteristics. Contextual information obtained by Sraf-net through contextual attention enables the network to focus on objects that are not apparent in appearance and use deformable convolution to enhance the feature representation. Although these methods use different feature enhancement methods, they are not sufficient for the detection of aerial images.

Disclosure of Invention

The invention aims to provide a vehicle detection method based on the anchor-free aerial images and based on semantic information and an attention mechanism, aiming at overcoming the defects of the prior art, and solving the problems of low vehicle identification accuracy and poor model robustness in the current aerial images.

The technical scheme adopted by the invention for solving the problems is as follows: a vehicle detection method based on anchor-free aerial images of semantic information and attention mechanism comprises the following steps:

step 1: a dataset acquisition process;

acquiring an aerial image aiming at a vehicle, marking the vehicle in the aerial image, and marking the type and the position of the vehicle in the aerial image;

step 2: constructing a neural network process;

fusing deep semantic information into shallow layers by using ResNet50 as a reference network, simultaneously using an attention mechanism for the characteristics of each layer, and changing an activation function into a dynamic activation function;

and step 3: a neural network training process;

sending the marked aerial photography vehicle image data set or the general aerial photography vehicle data set into the neural network constructed in the step 2 for training until the network converges;

and 4, step 4: testing the image detection process;

and detecting the vehicle target in the test image by adopting the trained neural network and the weight file.

Further, step 1 of the present invention comprises the following steps:

(step 1-1: the data set used is a DLR-3K data set which was taken at one kilometer height in munich, germany using a DLR-3K camera, and which contains 20 aerial images, each image being 5616 × 3744 in size, each data set being divided equally into 11 × 10 images to save computing resources, with some overlap between each image, resulting in a number of images of 702 × 624 size, and each image being rotated 90, 180, 270 degrees to expand the data set;

and 1-2, labeling the obtained data by using a labelImg data labeling tool, and labeling the vehicle category in the image as car. (ii) a

And 1-3, dividing the data set into a training set, a verification set and a test set according to the ratio of 6:2: 2.

Further, step 2 of the present invention comprises the steps of:

step 2-1, using ResNet50 as a backbone network to extract features, using the output of the C3, C4 and C5 layers of the network to compress the number of channels of each output to 256 to facilitate the next operation;

step 2-2, firstly, obtaining a P5 characteristic diagram by the C5 characteristic diagram after a dynamic activation function and an attention mechanism, and adding the P5 characteristic diagram after upsampling with the C4 characteristic diagram to obtain a new characteristic diagram P4; after the same operation is carried out on P4, a P3 feature map is obtained through a dynamic activation function and an attention mechanism after being added with a C3 feature map, a deep C5 feature map has larger receptive field and high-level semantic information, a shallow C3 feature map has better position information, and the deep semantic information is fused into the shallow layer to be detected by the shallow feature map in consideration of small vehicle area in an aerial image and the fact that the shallow feature map is more favorable for detecting small targets;

step 2-3, the P3 feature map is sent into a detection head, the detection head is provided with two branches, one branch is used for classification, the other branch is used for regression, the detection head is also provided with a central branch, the branch is parallel to the regression branch, the central branch is used for restraining a low-quality prediction frame far away from a central point, and the regression branch is used for predicting four values of l, r, t and b corresponding to each feature point and respectively representing the distance from a pixel point to the left side, the right side, the upper side and the lower side;

and 2-4, using a Focal loss function as a classification loss function, using IoU loss as a regression loss, and using a two-dimensional cross entropy loss function as a centrality loss function.

Further, step 3 of the present invention includes the following steps:

step 3-1, using ResNet50 as a main network, and using SGD gradient algorithm to train 36 epochs; the initial learning rate was set to 0.01 and decreased to one tenth of the original rate at 30,33, respectively; setting the size of each batch-size to 8, and exercising each picture resize to 1000 × 600;

and 3-2, trying different training hyper-parameters on the neural network, and training to obtain a network file and a weight file which can be used for aerial image vehicle detection.

Further, in the dynamic activation function used in step 2-2 of the present invention, we calculate four parameters, denoted as a1, a2, b1, b2, and then calculate max (a1 × X + b1, a2 × X + b2) by global semantic information, where X is an input feature map, and compared with the previous activation function, the activation function can learn dynamically according to the semantic information, has better robustness, and is beneficial to detection.

Further, the attention mechanism used in the step 2-2 of the present invention respectively uses a channel attention mechanism and a space attention mechanism; the more important channel can be selected by the network by using a channel attention mechanism, the weight of the unimportant channel is reduced, and the network can pay attention to the more important channel; the spatial attention mechanism uses a transformer to calculate the global interactivity, so that the network focuses on important positions.

Further, the loss function of the present invention is as follows:

FL(p _r )＝-α _t (1-p _r ) ^γ log(p _t )

Liou＝-ln(iou)

wherein FL (p) _t ) The loss function is the classification loss, p _t Is the prediction probability of the class, α _t And gamma two hyperparameters; the Liou loss function is regression loss, and iou is the intersection ratio of a prediction box and a ground-truth; l is the cross entropy loss function, y _i Table sample i labels, positive 1, negative 0, p _i Indicating the probability that sample i is predicted as positive.

Further, step 4 of the present invention includes the following steps:

step 4-1: sending the test image into an anchor free network to obtain characteristic diagrams of different layers of the network;

step 4-2: carrying out feature enhancement and fusion on the feature map to obtain a final feature map; step 4-3: inputting the characteristic diagram into a detection head, and outputting a prediction boundary value and a classification value;

step 4-4: and setting a threshold value, and filtering out a final detection result through non-maximum suppression.

Has the advantages that:

1. the method is based on semantic information and attention mechanism for detecting the anchor-free aerial image vehicle, and by using anchor free on the basis of the backbone network of ResNet50 and using feature fusion, deeper semantic information can be extracted, and the identification capability of small targets is enhanced.

2. According to the invention, through improving the activation function of the network and using the dynamic activation function, the characteristics can be effectively extracted, and the accuracy of vehicle detection is improved. In addition, the invention also uses an attention mechanism, strengthens the interaction capacity of the network and improves the robustness of the network.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

FIG. 2 is a flowchart of step 2 according to an embodiment of the present invention.

FIG. 3 is a flowchart of step 3 according to an embodiment of the present invention.

FIG. 4 is a flowchart of step 4 according to an embodiment of the present invention.

FIG. 5 is a diagram illustrating the detection results in the embodiment of the present invention.

Detailed Description

The invention will be described in more detail with reference to the accompanying drawings.

As shown in FIG. 1, the invention provides an anchor-free aerial image vehicle detection method based on semantic information and attention mechanism, which comprises the following steps:

step 1: acquiring a data set, namely acquiring an aerial image for a vehicle, marking the vehicle in the aerial image, and marking the type and the position of the vehicle in the aerial image;

step 2: the neural network construction process comprises the steps of using ResNet50 as a backbone network, using an anchor free detection mode, fusing deep semantic information to a shallow layer, using an attention mechanism for the characteristics of each layer, and converting an activation function into a dynamic activation function;

and step 3: a neural network training process, wherein the marked aerial vehicle image data set or the general aerial vehicle data set is sent to the neural network constructed in the step 2 for training until the network converges;

and 4, step 4: and in the test image detection process, the trained neural network and the weight file are used for detecting the vehicle target in the test image.

In this embodiment, the specific technical solution of the present invention includes:

step 1) aerial photography is carried out above a city by using an unmanned aerial vehicle, pictures containing vehicles are collected, and then categories in the pictures are labeled by using labeling software;

step 2), firstly, using an anchor free detection mode, then more effectively extracting features by using feature fusion and attention mechanism, and finally, changing the activation function into a dynamic activation function.

As shown in fig. 2, step 2 of the present invention includes the following steps:

step 201: adding attention mechanism and dynamic activation function to the outputs of the C3, C4 and C5 branches of ResNet50 respectively;

step 202: fusing semantic information in the deep characteristic diagram into the shallow characteristic diagram;

step 203: detecting the feature map after fusion by using an anchor free detection mode;

as shown in fig. 3, step 3 of the present invention includes the following steps:

step 301: the training image is preprocessed, and data augmentation operations such as turning, cutting, amplifying, reducing and the like are used.

Step 302: using ImageNet pre-training weight as initial weight, and setting learning rate, iteration times, batch _ size and the like;

step 303: and training the input image, and stopping training to obtain a weight file which can be used for aerial image vehicle detection when the loss function is converged or the maximum iteration number is reached.

As shown in fig. 4, step 4 of the present invention includes the following steps:

step 401: sending the test image into a ResNet main network to obtain characteristic layers of three scales;

step 402: fusing the scales of each layer after using an attention mechanism and a dynamic activation function;

step 403: processing the convolution characteristic graph through an anchor free algorithm, and outputting a prediction boundary box and a classification value;

step 404: through non-maximum suppression, only the detection frame with the best effect is reserved, the rest detection frames are filtered, if the detection frames are not good in effect, the detection frames are all filtered, and the final detection result is filtered.

FIG. 5 shows an aerial vehicle image and the detection results obtained by using the method of the present invention, and we can see from FIG. 5 that all four images have complex backgrounds and dense targets, and some are occluded by trees and houses. Through inspection, the mAP of the method can reach 89.6% in a DLR-3K data set.

The above-mentioned embodiments are merely illustrative of the preferred embodiments of the present invention, and the scope of the present invention should not be limited thereto, and any modifications made on the basis of the technical solutions according to the technical ideas presented by the present invention are within the scope of the present invention.

Claims

1. A vehicle detection method based on anchor-free aerial images of semantic information and attention mechanism is characterized by comprising the following steps:

step 1: a dataset acquisition process;

step 2: constructing a neural network process;

and step 3: a neural network training process;

and 4, step 4: testing the image detection process;

2. The method for detecting the vehicle based on the anchor-free aerial image of the semantic information and the attention mechanism as claimed in claim 1, wherein the step 1 comprises the following steps:

step 1-1: the data set used is a DLR-3K data set which is shot at one kilometer high above ground in Munich, Germany by using a DLR-3K camera, the data set comprises 20 aerial images, the size of each image is 5616 x 3744, each data set is averagely divided into 11 x 10 images to save computing resources, each image has certain overlap, a plurality of images with the size of 702 x 624 are obtained, and the data set is expanded by rotating each image by 90, 180 and 270 degrees;

step 1-2, labeling the obtained data by using a labelImg data labeling tool, and labeling the vehicle category in the image as car;

3. The method for detecting the vehicle based on the anchorless aerial image and the attention mechanism in the claim 1 is characterized in that the step 2 comprises the following steps:

step 2-1: using ResNet50 as a backbone network to extract features, and using the output of C3, C4 and C5 layers of the network;

step 2-2: firstly, a C5 feature map is subjected to a dynamic activation function and an attention mechanism to obtain a P5 feature map, and the P5 feature map is subjected to upsampling and then is added with the C4 feature map to obtain a new feature map P4; after the P4 is subjected to the same operation, the P3 characteristic map is obtained through a dynamic activation function and an attention mechanism after being added with the C3 characteristic map, the deep C5 characteristic map has larger receptive field and high-level semantic information, and the shallow C3 characteristic map has better position information;

step 2-3: feeding the P3 feature map into a detection head, wherein the detection head is provided with two branches, one branch is used for classification, and the other branch is used for regression;

step 2-4: the following examples illustrate the use of a Focal loss function as the classification loss function, Iouloss as the regression loss, and a centrality loss function using a two-dimensional cross-entropy loss function.

4. The method for detecting the vehicle through the anchor-free aerial images based on the semantic information and the attention mechanism as claimed in claim 1, wherein the step 3 comprises the following steps:

step 3-1: using ResNet50 as a backbone network, and using SGD gradient algorithm, training 36 epochs;

step 3-2: different training hyper-parameters are tried on the neural network for training, and a network file and a weight file which can be used for aerial image vehicle detection are obtained.

5. The method for detecting vehicles by anchorless aerial images based on semantic information and attention mechanism as claimed in claim 3, wherein the dynamic activation function used in step 2-2 calculates four parameters a1, a2, b1 and b2 by global semantic information, and then calculates max (a1 × X + b1 and a2 × X + b2), where X is the feature map of the input, and compared with the previous activation function, the activation function can learn dynamically according to the semantic information, and has better robustness and is beneficial to detection.

6. The method for detecting the vehicle based on the anchorless aerial image of the semantic information and the attention mechanism as claimed in claim 3, wherein the attention mechanism used in the step 2-2 respectively uses a channel attention mechanism and a spatial attention mechanism.

7. The method for detecting the vehicle through the anchor-free aerial image based on the semantic information and the attention mechanism is characterized in that the loss function is as follows:

FL(p _t )＝-α _t (1-p _t ) ^γ log(p _t )

Liou＝-ln(iou)

wherein FL (p) _t ) The loss function is the classification loss, p _t Is the prediction probability of the class, α _t And gamma two hyperparameters; the Liou loss function is regression loss, and iou is the intersection ratio of a prediction box and a ground-truth; l is the cross entropy loss function, y _i Labels for Table sample i, positive class 1, negative class0，p _i Indicating the probability that sample i is predicted as a positive class.

8. The method for detecting the vehicle based on the anchor-free aerial image of the semantic information and the attention mechanism as claimed in claim 1, wherein the step 4 comprises the following steps:

step 4-2: performing feature enhancement and fusion on the feature map to obtain a final feature map;

step 4-3: inputting the characteristic diagram into a detection head, and outputting a prediction boundary value and a classification value;