CN117710841A

CN117710841A - Small target detection method and device for aerial image of unmanned aerial vehicle

Info

Publication number: CN117710841A
Application number: CN202311725252.6A
Authority: CN
Inventors: 孙伟; 沈欣怡; 张小瑞; 管菲; 刘轩诚; 赵宇煌; 叶健峰; 郭邦祺
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2023-12-14
Filing date: 2023-12-14
Publication date: 2024-03-15

Abstract

The invention discloses a small target detection method and device for aerial images of an unmanned aerial vehicle, comprising the following steps: performing feature extraction on the real-time acquired aerial image of the unmanned aerial vehicle by using a feature extraction backbone network to obtain an initial feature map of the aerial image of the unmanned aerial vehicle; performing feature enhancement and feature refinement on the initial feature map by using the improved feature pyramid network to obtain a final feature map; detecting the final feature map by using the trained small target detection model to obtain a small target detection result of the unmanned aerial vehicle aerial image; the improved feature pyramid network comprises a context feature enhancement module and a feature pyramid refinement module. The invention can improve the feature expression capability of the feature extraction network on the small target and improve the detection precision of the small target detection.

Description

Small target detection method and device for aerial image of unmanned aerial vehicle

Technical Field

The invention relates to a method and a device for detecting a small target of an aerial image of an unmanned aerial vehicle, and belongs to the technical field of unmanned aerial vehicle target detection.

Background

The maturation of unmanned aerial vehicle technology has accelerated unmanned aerial vehicle's application in transportation system. Compared with the traditional fixed-position monitoring camera, the unmanned aerial vehicle monitoring system has the advantages of low cost, convenient deployment, high maneuverability, wider visual field and the like. In addition, the unmanned aerial vehicle can effectively avoid the shielding problem by adjusting the flying height and the position, and accurately and quickly monitor objects on different roads under the condition of not interfering with road traffic.

The current unmanned aerial vehicle target detection technology is mainly divided into two aspects: traditional computer vision methods and deep learning methods. The conventional computer vision method mainly depends on feature extraction and target recognition algorithms, and common feature extraction methods include gray level co-occurrence matrix, direction gradient Histogram (HOG), local Binary Pattern (LBP) and the like, and then target recognition is performed through a classifier such as a Support Vector Machine (SVM), adaBoost and the like. The traditional computer vision method has good performance in some simple scenes, but the performance is easy to be reduced for complex background, target scale change and other conditions. The deep learning method has a remarkable breakthrough in unmanned aerial vehicle target detection, and image characteristics and target representation can be automatically learned by using a deep neural network, particularly a Convolutional Neural Network (CNN). Some popular deep learning models such as fast R-CNN, YOLO, SSD and the like are widely used for unmanned aerial vehicle target detection tasks, and the models can realize real-time target detection and have better robustness and accuracy in complex scenes. The target detection algorithm is integrated into the high-precision camera of the unmanned aerial vehicle, so that road traffic data can be collected and processed from high altitude more flexibly and accurately, and the target searching efficiency is improved.

However, the unmanned aerial vehicle aerial image has a part of small target size and low resolution, available characteristic information is sparse and easy to miss detection and misdetection, fine granularity information of a small target in a characteristic diagram is gradually weakened in the process of convolutional neural network downsampling, characteristic expression capability is gradually degraded, and the miss detection and misdetection conditions of the network are further aggravated. Therefore, in the current deep learning and computer vision fields, the small target detection performance of the unmanned aerial vehicle target detection technology is poor and needs to be improved.

Disclosure of Invention

Aiming at solving the problems that ground targets captured by unmanned aerial vehicles from high altitude occupy fewer pixels in images and the target size is smaller, the invention provides an unmanned aerial vehicle aerial image small target detection method and device based on an end-to-end lightweight network.

In order to solve the technical problems, the invention is realized by adopting the following technical scheme.

In a first aspect, the invention provides a small target detection method for aerial images of an unmanned aerial vehicle, comprising the following steps:

performing feature extraction on the real-time acquired aerial image of the unmanned aerial vehicle by using a feature extraction backbone network to obtain an initial feature map of the aerial image of the unmanned aerial vehicle;

Performing feature enhancement and feature refinement on the initial feature map by using an improved feature pyramid network to obtain an ultimate feature map;

detecting the final feature map by using a trained small target detection model to obtain a small target detection result of the unmanned aerial vehicle aerial image;

the improved feature pyramid network comprises a context feature enhancement module and a feature pyramid refinement module.

With reference to the first aspect, further, the feature extraction is performed on the real-time acquired aerial image of the unmanned aerial vehicle by using a feature extraction backbone network to obtain an initial feature map of the aerial image of the unmanned aerial vehicle, including:

performing depth convolution and point convolution on the unmanned aerial vehicle aerial image by using the depth separable convolution layer to obtain a feature map I;

and performing attention weighting processing on the first feature map by using the reverse residual structure and the separated attention module to obtain a second feature map weighted by attention, namely an initial feature map of the unmanned aerial vehicle aerial image.

With reference to the first aspect, further, the performing feature enhancement and feature refinement on the initial feature map by using the improved feature pyramid network to obtain a final feature map includes:

Extracting context information around the initial feature map by using the context feature enhancement module to obtain a feature map III containing the context information;

and carrying out feature refinement on the feature map three in the channel dimension and the space dimension by utilizing the feature pyramid refinement module to obtain a final feature map of the unmanned aerial vehicle aerial image.

With reference to the first aspect, further, performing feature refinement on the feature map three in a channel dimension and a space dimension by using the feature pyramid refinement module includes:

under the channel dimension, respectively carrying out self-adaptive average pooling and self-adaptive maximum pooling treatment on the characteristic map III by utilizing a channel purification module to obtain the characteristic containing space contextAnd->Is a channel attention map; the expression of the channel purification module is as follows:

wherein M is _c (F) For the channel attention map output by the channel purification module, avgpool (F) represents adaptive average pooling, maxpool (F) represents adaptive maximum pooling,for adaptive averaging pooled spatial context features,/for example>For adaptive max-pooled spatial context features, W ₀ For parameter matrices for adaptive average pooling and adaptive max-pooled spatial context features, W ₁ Is a parameter matrix for hiding layers of the multi-layer perceptron;

in the spatial dimension, generating relative weights of various positions relative to the channel in the channel attention map through softmax, and utilizing the spatial purification module to respectively pair the spatial context characteristics based on the relative weightsAnd->Processing to obtain corresponding spatial feature ∈>And->By a standard convolution pair->And->Performing convolution processing to obtain a space attention diagram; the expression of the space purification module is as follows:

wherein M is _s (F) Output spatial attention map for spatial purification module, f ^7×7 Representing a convolution with a convolution kernel size of 7 x 7,is->Corresponding spatial features, < >>Is->Corresponding spatial features;

and fusing the channel attention map of the channel dimension with the space attention map of the space dimension to obtain the final feature map of the unmanned aerial vehicle aerial image.

With reference to the first aspect, further, in the training process of the small target model, calculating the overall loss of the small target detection model through the target classification loss, the frame regression loss and the centrality loss;

the expression of the joint loss function of the object classification and the frame regression of the small object detection model is as follows:

wherein L ({ p) _x，y }，{t _x，y }) represents a joint loss function of object classification and frame regression, { p _x，y The method comprises the steps of representing a set of prediction classification results of a small target detection model, wherein each pixel point position (x, y) in a final feature map of an aerial image of an unmanned aerial vehicle corresponds to a multidimensional vector p _x，y ，p _x，y Representing the prediction probability that pixel (x, y) belongs to different categories, { t _x，y The method comprises the steps of representing a set of prediction frame regression results of a small target detection model, wherein each pixel point position (x, y) corresponds to one four-dimensional vector t _x，y ，t _x，y A frame regression prediction value representing the pixel point (x, y),real object class label representing pixel point (x, y), a +.>Representing a frame regression target of pixel points (x, y), L _cls L for focus loss _reg To generalized cross ratio loss, N _pos Represents the positive sample number, lambda ₁ Is L _reg Balance weight of->To indicate the function, when->Time->1, otherwise->Is 0;

in the detection process, a large number of frames generated far from the center point of the target can obviously influence the effect of target detection, and in order to reduce the number of the frames with low quality, the invention provides a single-layer branch parallel to the regression branch of the frames to predict the centrality loss of the frames.

Describing the standardized distance between the pixel point in the final feature image and the target center responsible for the pixel point through the centrality loss;

combining the target classification loss, the frame regression loss and the centrality loss to obtain an overall loss function of the small target detection model:

Wherein L ({ p) _x,y },{t _x,y },{c _x,y }) represents the overall loss function, center _x,y Andrespectively representing the predicted centrality value and the true centrality value at the pixel point (x, y), lambda ₂ As a centrality loss function L _cen Balance weights of (a).

In a second aspect, the present invention provides a small target detection device for aerial images of an unmanned aerial vehicle, including:

the feature extraction module is used for carrying out feature extraction on the real-time acquisition unmanned aerial vehicle aerial image by utilizing a feature extraction backbone network to obtain an initial feature map of the unmanned aerial vehicle aerial image;

the feature enhancement refinement module is used for carrying out feature enhancement and feature refinement on the initial feature map by utilizing an improved feature pyramid network to obtain a final feature map;

the small target detection module is used for detecting the final feature map by using a trained small target detection model to obtain a small target detection result of the unmanned aerial vehicle aerial image;

in the feature extraction module, the feature extraction backbone network comprises a depth separable convolution layer, a reverse residual error structure and a component separation attention module which are connected in sequence;

in the feature enhancement refinement module, the improved feature pyramid network includes a contextual feature enhancement module and a feature pyramid refinement module.

With reference to the second aspect, further, the specific operation of the feature enhancement refinement module is:

With reference to the second aspect, further, performing feature refinement on the feature map three in a channel dimension and a space dimension by using the feature pyramid refinement module includes:

wherein M is _c (F) For the channel attention map output by the channel purification module, avgpool (F) represents adaptive average pooling, maxpool (F) represents adaptive maximum pooling,for adaptive averaging pooled spatial context features,/for example>For adaptive max-pooled spatial context features, W ₀ For parameter matrices for adaptive average pooling and adaptive max-pooled spatial context features, W ₁ For use inParameter matrix of hidden layer of multi-layer perceptron;

With reference to the second aspect, further, in the small target detection module, calculating an overall loss of the small target detection model through a target classification loss, a frame regression loss and a centrality loss;

wherein L ({ p) _x，y }，{t _x，y }，{c _x，y }) represents the overall loss function, center _x，y Andrespectively representing the predicted centrality value and the true centrality value at the pixel point (x, y), lambda ₂ As a centrality loss function L _cen Balance weights of (a).

Compared with the prior art, the invention has the beneficial effects that:

the invention provides a small target detection method and device for aerial images of an unmanned aerial vehicle, which reduce network parameters through a depth separable convolution and reverse residual error structure in a feature extraction stage, and can effectively strengthen the feature extraction capability of a backbone network by assisting in feature extraction through a component separation attention module. After the feature extraction, the invention introduces a context feature enhancement module and a feature pyramid refinement module, adds the context information into the initial feature, realizes feature enhancement, performs feature refinement, fully digs the dependency relationship between objects and between the objects and the background, well reserves the feature information while keeping low parameters, effectively improves the feature expression capability of the network on the small target, and is beneficial to the subsequent detection of the small target. According to the invention, a single-stage anchor-free design is used for constructing a small target detection model, complex super-parameter setting and calculation can be greatly reduced while positive and negative samples in a data set are balanced, a small target in an unmanned aerial vehicle aerial image can be accurately detected according to a feature map, and the detection precision of small target detection is improved.

Drawings

FIG. 1 is a schematic diagram showing steps of a method for detecting a small target of an aerial image of an unmanned aerial vehicle according to the present invention;

FIG. 2 is a schematic diagram of a feature extraction backbone network according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a small target detection model according to an embodiment of the invention.

Detailed Description

The following detailed description of the present invention is made with reference to the accompanying drawings and specific embodiments, and it is to be understood that the specific features of the embodiments and the embodiments of the present invention are detailed description of the technical solutions of the present invention, and not limited to the technical solutions of the present invention, and that the embodiments and the technical features of the embodiments of the present invention may be combined with each other without conflict.

Example 1

The embodiment introduces a small target detection method for aerial images of unmanned aerial vehicles, which adopts an end-to-end lightweight network structure to perform feature extraction and feature processing on aerial images of unmanned aerial vehicles, wherein the end-to-end lightweight network comprises a feature extraction backbone network and an improved feature pyramid network, the feature extraction backbone network mainly comprises a deep separable convolution layer, a reverse residual structure and a Group-separated attention module, and the improved feature pyramid network fuses a context feature enhancement (Context Feature Augmentation, CFA) module and a feature pyramid refinement (Feature Pyramid Refinement, FPR) module on the basis of the existing PANet feature pyramid structure. As shown in fig. 1, the method specifically comprises the following steps:

And step A, performing feature extraction on the real-time acquired unmanned aerial vehicle aerial image by using a feature extraction backbone network to obtain an initial feature map of the unmanned aerial vehicle aerial image.

In order to reduce the loss of characteristic information in convolution, a reverse residual structure is constructed on the basis of the depth separable convolution and is combined with a component separation attention module to realize initial characteristic extraction operation so as to obtain a characteristic extraction backbone network.

A01, after the input features of the aerial image of the unmanned aerial vehicle are extracted from a backbone network, firstly expanding the input image from low dimension to high dimension through a 1X 1 convolution layer to obtain a D size _F ×D _F ×M _t Wherein D is _F Representing the width and height of the image, M _t The channel expansion factor t is used to control the expansion factor, representing the number of channels of the image.

Step A02, utilizing depth separable convolution layer pair with size D _F ×D _F ×M _t Small target features in the high-dimensional image of (2) are learned. The standard convolution is decomposed into a depth convolution and a point convolution by the depth separable convolution, each input channel of the input image can be filtered through the depth convolution to obtain filtered output characteristics, and the filtered output characteristics are linearly combined through the point convolution to obtain a characteristic diagram I, wherein the characteristic diagram I comprises multi-level characteristics.

In the embodiment of the invention, 3X 3 depth convolution with step size s is used to pass M D' s _K ×D _K The x 1 convolution filters each channel of the high-dimensional image with an output size D _f ×D _f X 1, wherein D _k Representing convolutionWidth and height of core, D _f Representing the width and height of the filtered feature map; the output of the depth convolution is linearly combined using N1 x M point convolutions and projected back to a low dimensional feature, with a final output size D _f ×D _f Feature diagram one of x N. Wherein the calculation cost of the depth convolution is D _K ×D _K ×1×M×D _F ×D _F The calculation cost of the point convolution is 1×1×m×n×d _F ×D _F . Depth separable convolution layer can effectively reduce the computational effort by decomposing the standard convolution into a filtering and combining two-step process.

The computational effort that depth separable convolution can reduce is expressed as:

and step A03, processing the first feature map through a reverse residual error structure and a component separation attention module to obtain an attention weighted second feature map, namely an initial feature map of the unmanned aerial vehicle aerial image.

The invention combines the reverse residual error structure and the component separation attention module to construct four basic convolution blocks with convolution step sizes of 1 and 2, so that the attention of the feature map is separated in two groups under two convolution step sizes, and the feature information is well reserved while the low parameters are kept.

In the embodiment of the present invention, parameters of the reverse residual structure are shown in table 1:

TABLE 1

And B, performing feature enhancement and feature refinement on the initial feature map by using the improved feature pyramid network to obtain a final feature map.

The invention combines the PANet characteristic pyramid structure, provides a context characteristic enhancement module and a characteristic pyramid refinement module, wherein the context characteristic enhancement module can extract context information existing in a backbone network, and inject the context information around an object into the characteristic pyramid network to obtain a characteristic diagram containing multi-scale information, and the characteristic pyramid refinement module can process the multi-scale information to obtain characteristics with stronger spatial characteristics so as to prevent small targets from being submerged in conflicting information and improve the detection precision of the small targets.

And B01, fusing and injecting the second feature map from top to bottom into a feature pyramid, and obtaining a third feature map containing rich context information after all levels of features in the second feature map pass through a context feature enhancement module.

In the embodiment of the invention, the context feature enhancement module uses three hole convolutions with different expansion rates to extract the context information around the target in the unmanned aerial vehicle aerial image, and the surrounding context information is injected into the feature pyramid network, and finally the information extracted by the context feature enhancement module is fused through Concat operation. The cavity convolution is a new convolution form, and the space between values when the convolution kernel processes data is controlled by introducing 'expansion rate', and the cavity convolution with different expansion rates can extract the features in the areas with different sizes around the target. Aiming at the high-level features with smaller size in the feature extraction backbone network, the invention adopts a context feature enhancement module with expansion rates of 1, 2 and 4 to inject context information around a target into the feature pyramid network from top to bottom; and aiming at the low-level features with larger sizes in the feature extraction backbone network, a context feature enhancement module with expansion rates of 1, 3 and 5 is adopted to inject context information into the feature pyramid network from bottom to top.

And B02, inputting the third feature image into a feature pyramid refining module, and performing feature refining on the third feature image through the feature pyramid refining module to obtain a final feature image of the unmanned aerial vehicle aerial image.

In the processing process of the context feature enhancement module, redundant information and conflict information are introduced while information is shared due to the fact that the feature pyramids of different levels have difference semantically, and therefore the redundant information and the conflict information are removed through the feature pyramid refinement module.

The feature pyramid refinement module introduces a feature refinement mechanism in the channel dimension and the space dimension, and can process multi-scale information. The feature pyramid refining module mainly comprises a channel purifying module and a space purifying module, wherein the channel purifying module and the space purifying module generate self-adaptive weights in a channel dimension and a space dimension, and guide features to learn in a more critical direction. After the context information is injected into the feature pyramid structure by the context feature enhancement module, the feature pyramid network carries out up-sampling and down-sampling operations on the high-level features and the low-level features through a top-down path and a top-up path respectively, so that the high-level features and the low-level features are adjusted to be uniform in size, compressed in the space dimension and aggregated in the global space information of the image.

In step B02-1, under the channel dimension, the channel purification module respectively pools the feature map III by using an adaptive average pool and an adaptive maximum pool to obtain two different spatial context featuresAnd->And generating an inclusion via the shared network>And->Channel attention map M of (2) _C ∈R ^C×1×1 Wherein C is the number of channels of the feature map.

In the embodiment of the invention, a shared network is constructed by a multi-layer persistence (MLP) and a hidden layer, and in order to reduce parameters, the activation value size of the hidden layer is R ^C/r×1×1 Where r is the compression ratio.

To simplify the computation, the present invention uses the ReLU function as the activation function, and the computation process of the channel cleaning module can be expressed as:

wherein M is _c (F) For the channel attention map output by the channel purification module, avgpool (F) represents adaptive average pooling, maxpool (F) represents adaptive maximum pooling,for adaptive averaging pooled spatial context features,/for example>For adaptive max-pooled spatial context features, W ₀ Is a parameter matrix for adaptive average pooling and adaptive max-pooling spatial context features, W ₁ Is a parameter matrix for a multi-layer perceptron (MLP) hidden layer, W ₀ ∈R ^C/r×C ，W ₁ ∈R ^C ^×C/r C is the number of channels of the feature map.

Step B02-2, in the space dimension, the space purification module aggregates the channel information through pooling operation to generate two feature graphsAnd->Each representing the average pooling feature and the maximum pooling feature over the channel, then p by a standard convolution>And->Performing convolution processing to generate a spatial attention map M _S ∈R ¹ ^×H×W Where H and W represent the height and width of the spatial signature.

The calculation process of the spatial purge module can be expressed as:

wherein M is _s (F) Output spatial attention map for spatial purification module, f ^7×7 Representing a convolution with a convolution kernel size of 7 x 7,is->Corresponding spatial features, < >>Is->Corresponding spatial features.

And step B02-3, directly adding and fusing the channel attention map of the channel dimension and the space attention map of the space dimension to obtain a refined final feature map. The refined final feature map can be expressed as F E R ^C×H×W The refined features have better expression capability and adaptability.

And C, constructing and training a small target detection model.

In order to solve the problem of unbalance between positive and negative samples in small target detection, the invention provides a single-stage anchor-frame-free small target detection model, which is connected after a feature pyramid refinement module and forms two branches through four convolution layers for classification and regression tasks respectively as shown in fig. 3.

In the embodiment of the invention, the feature map is mapped back to the original image, each pixel point is divided into a positive sample and a negative sample through the coordinate position of the pixel point in the feature map after being mapped back to the original image, and then the target classification, the bounding box regression and the centrality loss of the small target detection model are calculated, and the training process for learning the small target detection model based on the loss is as follows:

(1) A large number of unmanned aerial vehicle aerial images and manual annotation data (comprising manual annotation real target categories and real annotation frames) are utilized to form a data set. And (5) performing feature extraction, feature enhancement and refinement on the unmanned aerial vehicle aerial images in the data set through the operation in the step A, B to obtain a final feature map of each unmanned aerial vehicle aerial image. In the embodiment of the invention, the final feature map of each unmanned aerial vehicle aerial image contains multiple layers of features, so that each final feature map can be split into multiple layers of sub-feature maps, and each pixel point position in each layer of sub-feature map is regarded as a training sample to obtain a training sample set.

In the embodiment of the invention, as a plurality of targets can exist in each unmanned aerial vehicle aerial image, each unmanned aerial vehicle aerial image can have a plurality of real target categories and a plurality of real annotation frames, and the real annotation frames of the unmanned aerial vehicle aerial image are defined as a sequence { B } _j }, wherein B is _j The j-th real annotation frame in the aerial image of the unmanned aerial vehicle, and->Representing the coordinates of the upper left corner and the lower right corner of the jth real labeling frame, c ^j And marking the category of the target in the frame for the j-th real.

(2) Let the ith layer sub-feature map in the final feature map be F _i ∈R ^H×W×C Sub-feature map F _i The middle pixel point (x, y) is mapped back to the original unmanned aerial vehicle aerial image to obtain a mapped coordinate positionWherein s is the total step length from the feature extraction backbone network to the ith layer sub-feature map, so that gradient explosion in the training process can be prevented. The mapped coordinates need to be ensured in the mapping processThe position is located near the center of the receptive field at point (x, y) to preserve more spatial information in the original image.

(3) If sub-feature map F _i Coordinate position of pixel point (x, y) after mapping True marking frame B of aerial image of unmanned aerial vehicle _j In the method, the pixel point (x, y) is regarded as a positive sample, and the true target class label c of the pixel point (x, y) is made ^* And a true annotation frame B _j Class c of inner targets ^j Consistent; otherwise, the pixel point (x, y) is regarded as a negative sample, namely the background class, and the true target class label c of the pixel point (x, y) is given ^* ＝0。

(4) And (3) repeating the steps (2) to (3) to obtain the real target class labels of all the pixel points in the sub-feature map in the training sample set.

(5) Besides classifying labels, the invention constructs a four-dimensional real number vector v according to the real labeling frame ^* ＝(l ^* ，t ^* ，r ^* ，b ^* ) As a sub-feature map F _i The frame of the middle pixel point (x, y) returns to the target, wherein l ^* 、t ^* 、r ^* And b ^* The vertical distances from the pixel point (x, y) to the left boundary, the upper boundary, the right boundary and the lower boundary of the real labeling frame are respectively.

If the pixel point (x, y) is located in the real labeling frame B _j And (3) if the inner part is:

(6) Inputting the final feature image of the aerial image of the unmanned aerial vehicle into a small target detection model, and outputting a multi-dimensional vector for classification by the final layer of predictionAnd a four-dimensional vector with frame coordinate coding information +.>Wherein the multidimensional vector->Comprising probabilities that the object belongs to multiple categories, four-dimensional vector +.> The method comprises the step of including the distance between each pixel point in the aerial image of the unmanned aerial vehicle and the target frame.

Unlike conventional detection network training of a multi-class classifier, the present invention achieves better classification accuracy by training C classifier.

(7) In order to train the small target detection model, the invention establishes a combined loss function of target classification and frame regression of the small target detection model according to the real target class label obtained in the step (4), the frame regression target obtained in the step (5) and the small target detection model prediction result obtained in the step (7), and a sub-feature diagram F _i All positions above are summed with a loss function expressed as follows:

wherein L ({ p) _x，y }，{t _x，y -a) represents a joint loss function of object classification and frame regression; { p _x，y The method comprises the steps of representing a set of prediction classification results of a small target detection model, wherein each pixel point position (x, y) corresponds to one ten-dimensional vector p _x，y Ten-dimensional vector p _x，y The prediction probability is used for representing that the pixel points (x, y) belong to different categories; { t _x，y The method comprises the steps of representing a set of prediction frame regression results of a small target detection model, wherein each pixel point position (x, y) corresponds to one four-dimensional vector t _x，y Four-dimensional vector t _x，y A frame regression prediction value for representing the pixel point (x, y);a real object class label representing a pixel point (x, y) for calculating a class loss term in the loss function; />The bounding box representing the pixel (x, y) regresses the object. L (L) _cls Is the focus loss, L _reg Is generalized cross-ratio loss, N _pos Represents the positive sample number, lambda ₁ Is L _reg Balance weight of->To indicate the function, when->And 1 if not, and 0 if not. Lambda in the present invention ₁ The value of (2) is 1.

The centrality loss describes the normalized distance of a point from its responsible target center, step (5) gives a regression target of one point of l ^* 、t ^* 、r ^* And b ^* The calculation formula of the true centrality is as follows:

where square root is used to mitigate the rate of decrease of the centrality loss.

The value range of the centrality loss is [0,1], training is carried out through binary cross entropy loss (Binary Cross Entropy Loss) to restrain the confidence coefficient of the frame far away from the target center, and finally non-maximum value restraint (non-maximum suppression, NMS) screens most of unqualified frames, so that the quality of the target frames is improved, the detection performance of a network is greatly improved, and final target position information is output.

The invention combines the target classification, frame regression and centrality loss together to obtain the final overall loss function:

wherein L ({ p) _x,y },{t _x,y },{c _x,y }) represents an overall loss function that adds object classification, frame regression, and centrality loss _x,y Andrespectively representing the predicted centrality value and the true centrality value at the pixel point (x, y), lambda ₂ As a centrality loss function L _cen In the present invention lambda ₂ The value of (2) is 1.

(8) Calculating the loss value of the current iteration of the small target detection model through a formula (7), updating the parameters of the small target detection model according to the loss value of the current iteration, enabling the small target detection model to fully learn modes and features in a large amount of manual marking data, and repeating iteration until loss converges, so that a trained small target detection model is obtained.

And D, detecting the final feature map output in the step B by using the small target detection model trained in the step C to obtain a small target detection result of the aerial image of the unmanned aerial vehicle, wherein the small target detection result comprises a target class and a target frame.

Example 2

The embodiment introduces a small target detection device for aerial images of unmanned aerial vehicles based on the same inventive concept as that of embodiment 1, and the small target detection device comprises a feature extraction module, a feature enhancement refinement module and a small target detection module.

The feature extraction module is used for carrying out feature extraction on the real-time acquisition unmanned aerial vehicle aerial image by utilizing the feature extraction backbone network to obtain an initial feature map of the unmanned aerial vehicle aerial image.

The feature enhancement refinement module is used for carrying out feature enhancement and feature refinement on the initial feature map by utilizing the improved feature pyramid network to obtain a final feature map.

The small target detection module is used for detecting the final feature map by using the trained small target detection model to obtain a small target detection result of the unmanned aerial vehicle aerial image.

The specific function implementation of each module is described in the method of reference embodiment 1, and is not repeated, and specifically noted is that:

in the feature extraction module, a feature extraction backbone network comprises a depth separable convolution layer, a reverse residual structure and a component separation attention module which are connected in sequence.

In summary, the embodiment of the invention improves the existing convolutional neural network, reduces network parameters through the deep separable convolution and reverse residual structure in the feature extraction stage, and can effectively strengthen the feature extraction capability of the backbone network by assisting in feature extraction through the component separation attention module. After the feature extraction, the invention introduces a context feature enhancement module and a feature pyramid refinement module, adds the context information into the initial feature, realizes feature enhancement, performs feature refinement, fully digs the dependency relationship between objects and between the objects and the background, well reserves the feature information while keeping low parameters, effectively improves the feature expression capability of the network on the small target, and is beneficial to the subsequent detection of the small target. According to the invention, a single-stage anchor-free design is used for constructing a small target detection model, complex super-parameter setting and calculation can be greatly reduced while positive and negative samples in a data set are balanced, a small target in an unmanned aerial vehicle aerial image can be accurately detected according to a feature map, and the detection precision of small target detection is improved.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The embodiments of the present invention have been described above with reference to the accompanying drawings, but the present invention is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those having ordinary skill in the art without departing from the spirit of the present invention and the scope of the claims, which are all within the protection of the present invention.

Claims

1. The small target detection method for the aerial image of the unmanned aerial vehicle is characterized by comprising the following steps of:

2. The method for detecting a small target of an aerial image of an unmanned aerial vehicle according to claim 1, wherein the feature extraction of the aerial image of the unmanned aerial vehicle acquired in real time by using the feature extraction backbone network, to obtain an initial feature map of the aerial image of the unmanned aerial vehicle, comprises:

3. The method for detecting a small target of an aerial image of an unmanned aerial vehicle according to claim 1, wherein the performing feature enhancement and feature refinement on the initial feature map by using the improved feature pyramid network to obtain a final feature map comprises:

4. A small target detection method for aerial images of an unmanned aerial vehicle according to claim 3, wherein the feature pyramid refinement module is used for performing feature refinement on the feature map three in a channel dimension and a space dimension respectively, and the method comprises the following steps:

under the channel dimension, respectively carrying out self-adaptive average pooling and self-adaptive maximum pooling treatment on the characteristic map III by utilizing a channel purification module to obtain the characteristic containing space context And->Is a channel attention map; the expression of the channel purification module is as follows:

wherein M is _s (F) Is emptySpatial attention map, f, output by inter-purification module ^7×7 Representing a convolution with a convolution kernel size of 7 x 7,is->Corresponding spatial features, < >>Is->Corresponding spatial features;

5. The method for detecting the small target of the aerial image of the unmanned aerial vehicle according to claim 1, wherein in the training process of the small target model, the total loss of the small target detection model is calculated through the target classification loss, the frame regression loss and the centrality loss;

wherein L ({ p) _x,y },{t _x,y }) represents a joint loss function of object classification and frame regression, { p _x,y The method comprises the steps of representing a set of prediction classification results of a small target detection model, wherein each pixel point position (x, y) in a final feature map of an aerial image of an unmanned aerial vehicle corresponds to a multidimensional vector p _x,y ，p _x,y Representing the prediction probability that pixel (x, y) belongs to different categories, { t _x,y Set of predictive bounding box regression results representing small object detection modelAnd each pixel point position (x, y) corresponds to one four-dimensional vector t _x,y ，t _x,y A frame regression prediction value representing the pixel point (x, y),real object class label representing pixel point (x, y), a +.>Representing a frame regression target of pixel points (x, y), L _cls L for focus loss _reg To generalized cross ratio loss, N _pos Represents the positive sample number, lambda ₁ Is L _reg Balance weight of->To indicate the function, when->Time->1, otherwise->Is 0;

6. The utility model provides a little target detection device of unmanned aerial vehicle aerial image which characterized in that includes:

7. The small target detection device for aerial images of an unmanned aerial vehicle according to claim 6, wherein the specific operation of the feature enhancement refinement module is as follows:

8. The unmanned aerial vehicle aerial image small target detection device of claim 7, wherein the feature pyramid refinement module is configured to refine the feature map three in a channel dimension and a space dimension, respectively, and comprises:

wherein M is _c (F) For the channel attention map output by the channel purification module, avgpool (F) represents adaptive average pooling, maxpool (F) represents adaptive maximum pooling,for adaptive averaging pooled spatial context features,/for example>For adaptive max-pooled spatial context features, W ₀ For parameter matrices for adaptive average pooling and adaptive max-pooled spatial context features, W ₁ Hiding parameter moments for layers of a multi-layer perceptronAn array;

in the spatial dimension, generating relative weights of various positions relative to the channel in the channel attention map through softmax, and utilizing the spatial purification module to respectively pair the spatial context characteristics based on the relative weights And->Processing to obtain corresponding spatial feature ∈>And->By a standard convolution pair->And->Performing convolution processing to obtain a space attention diagram; the expression of the space purification module is as follows:

9. The small target detection device for aerial images of an unmanned aerial vehicle according to claim 6, wherein in the small target detection module, the overall loss of a small target detection model is calculated through target classification loss, frame regression loss and centrality loss;

wherein L ({ p) _x,y },{t _x,y }) represents a joint loss function of object classification and frame regression, { p _x,y The method comprises the steps of representing a set of prediction classification results of a small target detection model, wherein each pixel point position (x, y) in a final feature map of an aerial image of an unmanned aerial vehicle corresponds to a multidimensional vector p _x,y ，p _x,y Representing the prediction probability that pixel (x, y) belongs to different categories, { t _x,y The method comprises the steps of representing a set of prediction frame regression results of a small target detection model, wherein each pixel point position (x, y) corresponds to one four-dimensional vector t _x,y ，t _x,y A frame regression prediction value representing the pixel point (x, y),real object class label representing pixel point (x, y), a +.>Representing a frame regression target of pixel points (x, y), L _cls L for focus loss _reg To generalized cross ratio loss, N _pos Represents the positive sample number, lambda ₁ Is L _reg Balance weight of->To indicate the function, when->Time->1, otherwise->Is 0;

wherein L ({ p) _x,y },{t _x,y },{c _x,y }) represents the overall loss function, center _x,y Andrespectively representing the predicted centrality value and the true centrality value at the pixel point (x, y), lambda ₂ As a centrality loss functionL _cen Balance weights of (a).