CN116958687A

CN116958687A - Unmanned aerial vehicle-oriented small target detection method and device based on improved DETR

Info

Publication number: CN116958687A
Application number: CN202310931094.3A
Authority: CN
Inventors: 杜强; 姜明新; 洪远; 王杰; 项靖; 黄俊闻
Original assignee: Huaiyin Institute of Technology
Current assignee: Huaiyin Institute of Technology
Priority date: 2023-07-27
Filing date: 2023-07-27
Publication date: 2023-10-27

Abstract

The invention discloses a small target detection method and device for an unmanned aerial vehicle based on improved DETR, which are used for constructing a data set aiming at small target detection of the unmanned aerial vehicle and dividing the data set into a training set and a testing set; constructing a small target detection network based on improved DETR; adopting a SheffeNet-d as a feature extraction network of the DETR, introducing a 1X 1 convolution module, and extracting features along the channel dimension; wherein, the SheffeNet-d is to delete the original global pooling and full connection layer of the SheffeNet v 2; replacing self-attribute in the encoder in the DETR with FlashA section-2; the Neck layer in the DETR adopts a deformable trans-scale feature fusion module defoforming-CCFM; adopting Smooth-L1 and DIOULoss as a loss function of the improved DETR lightweight characteristic extraction network based on FlashA technology-2; the data set is used to train and evaluate the small target detection network based on the improved DETR. According to the invention, aiming at the problem that a small target is difficult to detect in an unmanned plane scene, the design of a network structure is carried out, multi-scale and multi-level information is fused, the representation capability of the network is improved, and the detection precision of the small target is improved.

Description

Unmanned aerial vehicle-oriented small target detection method and device based on improved DETR

Technical Field

The invention belongs to the application of deep learning in the field of computer vision, and particularly relates to an unmanned aerial vehicle-oriented small target detection method and device based on improved DETR.

Background

Unmanned aerial vehicle target recognition becomes more and more popular, unmanned aerial vehicles gradually become an indispensable important part in complex scenes, unmanned aerial vehicle aerial data are similar to remote sensing image data in a few minutes, and small targets are in a lot in aerial data images. Unmanned aerial vehicles are typically deployed in large scenes, which means that various objects of interest in one image, such as pedestrians, bicycles, automobiles, etc., are small in scale and are easily disturbed by the environment due to the high photographing height, resulting in difficulty in detection by conventional object detection methods. The traditional unmanned aerial vehicle aerial image target detection method has the problems of high omission rate, low detection success rate, large model size and the like. Therefore, improving the detection capability of an algorithm on small targets in an aerial image of an unmanned aerial vehicle becomes a challenging research direction in the field of target detection, and meanwhile, considering the application of the algorithm to small equipment in an unmanned aerial vehicle scene, the model is required to be quantized, the calculated amount is reduced, and the utilization rate of a memory is improved.

DETR is the first end-to-end algorithm based on a transducer, and no anchor pre-processing and NMS post-processing exist, but the DETR is slow to converge, slow to train and slow to infer, and although the subsequent optimization algorithm continuously accelerates the convergence speed and improves the inference speed, the real-time requirement cannot be realized. DETR performs poorly on small object detection, existing detectors typically have multi-scale features, small object targets are typically detected on high-resolution feature maps, whereas DETR does not detect with multi-scale features, primarily high-resolution feature maps add unacceptable computational complexity to DETR. In the prior art, the DETR needs longer training time to converge, the model depth is increased for pursuing target detection precision by the variation of the DETR series, the number of stacking parameters is increased, the model structure is complex, the parameter quantity is large, the application of middle-low-end equipment is not facilitated, and the practical application situation is ignored. Because the time complexity and memory complexity of the transducer core self-attention module is quadratic in sequence length. Reduced attention and memory requirements have also been proposed, but they have too much focus on reducing the number of floating point operations performed per second and tend to ignore overhead from internal accesses.

Therefore, aiming at the problems that the existing DETR target detection algorithm is difficult to detect small targets under the unmanned aerial vehicle scene and the problem that the size of an aerial image detection model deployed on the unmanned aerial vehicle is relatively large, a small target detection method facing the unmanned aerial vehicle scene is needed, so that the target detection task in the aerial image can be rapidly and accurately realized under the condition of hardware resource limitation in an aerial image detection system deployed on the unmanned aerial vehicle.

Disclosure of Invention

The invention aims to: the invention provides an unmanned aerial vehicle-oriented small target detection method and device based on improved DETR, which are used for integrating multi-scale and multi-level information and improving the characterization capability of a network so as to improve the detection precision of the small target.

The technical scheme is as follows: the invention provides an unmanned aerial vehicle-oriented small target detection method based on improved DETR, which specifically comprises the following steps:

(1) Constructing a data set aiming at unmanned aerial vehicle small target detection, and dividing the data set into a training set and a testing set;

(2) Constructing a small target detection network based on improved DETR; adopting a SheffeNet-d as a feature extraction network of the DETR, introducing a 1X 1 convolution module, and extracting features along the channel dimension; wherein, the SheffeNet-d is to delete the original global pooling and full connection layer of the SheffeNet v 2; replacing self-attribute in the encoder in the DETR with FlashA section-2; the Neck layer in the DETR adopts a deformable trans-scale feature fusion module defoforming-CCFM;

(3) Adopting Smooth-L1Loss and DIoU Loss as a Loss function of the improved DETR lightweight characteristic extraction network based on FlashA technology-2;

(4) Training a small target detection network based on improved DETR by using a training set;

(5) And inputting the test set into a trained small target detection network based on the improved DETR, and evaluating the network to realize unmanned plane-oriented small target detection.

Further, the implementation process of the SheffeNet-d in the step (2) is as follows:

firstly, an initial image firstly passes through a 3 multiplied by 3 convolution layer with the step length of 2, the convolution layer uses a filter to extract the characteristics of the image, and the size of the image is changed into the original 1/2 size; performing maximum pooling operation on the generated feature map, using pooling cores with the size of 2×2, and taking the maximum value in each 2×2 area, so that the space dimension of the feature map is halved, and the image size is changed into the original 1/4 size; the method comprises the steps that a stage module is formed by a SheffeNet V2 unit 1 and a SheffeNet V2 unit 2, the repetition times of the units 1 and the units 2 in different stage modules are different, the first block of each stage is formed by the SheffeNet V2 unit 1, the stride is 2, the downsampling operation is completed, and the output channel is doubled; in the stage2 module, the times 1 and 3 of repeating the unit 1 and the unit 2 are changed to 1/8 of the original size of the output image; in the stage3 module, the number of times 1 and 7 of repetition of the unit 1 and the unit 2, the output image size becomes 1/16 of the original size; in the stage4 module, the number of times 1 and 3 of repetition of the unit 1 and the unit 2 is repeated, and the output image size is changed to the original 1/32 size; then, the outputs of stage2, stage3 and stage4 are used as multi-scale features, the channel number is unified through 1×1conv, as the input of the multi-scale feature fusion module, the feature map output by stage2 is output as S3 through 1×1conv, the feature map output by stage3 is output as S4 through 1×1conv, and the feature map output by stage4 is output as S5 through 1×1 conv.

Further, the deformable trans-scale feature Fusion module de-formable-CCFM in the step (3) completes feature Fusion through a Fusion module; f5 is taken as F _high S4 as F _low F is firstly carried out _high Upsampling to sum F _low The feature map of the same size is then identical to F _low Adding channels, performing 1×1 convolution, reducing the number of channels to the previous dimension, dividing the output into two parts, and performing feature interaction on one part through n repeated Repvgg-block blocks; the other part is a residual edge part which is directly connected with the output, and finally the two parts are added element by element, and the output fused by the first Fusion module is taken as F _high The method comprises the steps of carrying out a first treatment on the surface of the S3 is taken as F _low And similarly, finishing a second Fusion module and outputting a final Fusion feature map.

Further, the Repvggblock has three branches, a main branch with a convolution kernel size of 3×3, a branch with a convolution kernel size of 1×1 and a branch connected with only BN, and the three branches are added element by element, and finally the PRelu activation function is performed.

Further, the step (4) is implemented by the following formula:

wherein b _σ(i) The target box representing the i-th index,prediction box, b representing the ith index ^gt Representing the center points of the prediction frame and the target frame respectively, wherein ρ represents the Euclidean distance between the two center points; c represents the best that can cover both the predicted and target framesDiagonal distance of small rectangle.

Based on the same inventive concept, the present invention also provides an apparatus device comprising a memory and a processor, wherein:

a memory for storing a computer program capable of running on the processor;

a processor for performing the unmanned-oriented small-target detection method steps based on the improved DETR as described above when running said computer program.

Based on the same inventive concept, the present invention also provides a storage medium having stored thereon a computer program which, when executed by at least one processor, implements the unmanned oriented small object detection method steps based on improved DETR as described above.

The beneficial effects are that: compared with the prior art, the invention has the beneficial effects that: aiming at the problem that the existing target detection algorithm is difficult to detect a small target under the unmanned aerial vehicle scene and on a detection model deployed on the unmanned aerial vehicle, the design of a network structure is carried out again, multi-scale and multi-layer information is fused, the characterization capability of the network is improved, and the detection precision of the small target is improved; the lightweight backhaul is used for deployment in practical scenes such as unmanned aerial vehicles better; the novel attention of using the flashportion-2 is to greatly reduce the calculation complexity and improve the calculation efficiency and the utilization rate of the memory.

Drawings

Fig. 1 is a schematic diagram of a modified DETR small target detection network according to the present invention;

FIG. 2 is a schematic diagram of a SheffeNet-d module according to the present invention;

FIG. 3 is a schematic diagram of a module structure of Flashation-2 according to the present invention;

FIG. 4 is a schematic diagram of a default-CCFM module according to the present invention;

FIG. 5 is a schematic diagram of a Fusion module according to the present invention;

FIG. 6 is a schematic diagram of a modified Repvgg-block module structure according to the present invention;

Detailed Description

The invention is described in further detail below with reference to the accompanying drawings.

The invention provides an unmanned aerial vehicle-oriented small target detection method based on improved DETR, which specifically comprises the following steps:

step S1: and constructing a data set aiming at the detection of the small target of the unmanned aerial vehicle, and dividing the data set into a training set and a verification set.

The VisDrone2019 data set is selected and collected by using different unmanned aerial vehicle platforms under different scenes, different weather and illumination conditions, wherein the data set comprises categories of pedestrians, automobiles, bicycles and the like, the backgrounds are different, and the generalization capability of a detection model can be better improved due to different categories. Including 6471 training samples and 1610 test samples, and converting the format of the data set noted in the visclone data set to COCO format.

Step S2: constructing a small target detection network based on improved DETR, as shown in FIG. 1, adopting a SheffeNet-d as a feature extraction network of the DETR, introducing a 1X 1 convolution module, and extracting features along the channel dimension; wherein, the SheffeNet-d is to delete the original global pooling and full connection layer of the SheffeNet v 2; replacing self-attribute in the encoder in the DETR with FlashA section-2; the Neck layer in DETR adopts a variable trans-scale feature fusion module, a variable-form-CCFM.

The invention changes the background of the original DETR into a lightweight feature extraction network, namely a SheffleNet-d, and specifically as shown in figure 2, firstly, an initial image firstly passes through a 3X3 convolution layer with a step length of 2, the convolution layer uses a plurality of filters (also called convolution kernels or convolution weights) to extract the features of the image, and the image size is changed into the original 1/2 size. The resulting feature map is then maximally pooled, typically using a 2x2 size pooling kernel, with a maximum value in each 2x2 region, thereby halving the spatial dimension of the feature map and changing the image size to the original 1/4 size. The stage module consists of a SheffeNetV 2 unit 1 and a SheffeNetV 2 unit 2, the repetition times of the units 1 and the units 2 in different stage modules are different, the first block of each stage consists of the SheffeNetV 2 unit 1, the stride is 2, the downsampling operation is completed, and the output channel is doubled. In the stage2 module, the times 1 and 3 of repeating the unit 1 and the unit 2 are changed to 1/8 of the original size of the output image; in the stage3 module, the number of times 1 and 7 of repetition of the unit 1 and the unit 2, the output image size becomes 1/16 of the original size; in the stage4 module, the number of times of repetition of the units 1 and 2 1 and 3, the output image size becomes the original 1/32 size. Then, the outputs of stage2, stage3 and stage4 are used as multi-scale features, the channel number is unified through 1×1conv, as the input of the multi-scale feature fusion module, the feature map output by stage2 is output as S3 through 1×1conv, the feature map output by stage3 is output as S4 through 1×1conv, and the feature map output by stage4 is output as S5 through 1×1 conv.

As shown in FIG. 3, the S5 is embedded as an x input, and three different linear transformations, called Query, key and Value, are represented by Q, K and V, respectively. And fed into flashportion-2, split Q into several warp while keeping K and V accessible to all warp. Each warp performs a matrix multiplication to obtain a slice of QK T, and then only multiplies with a shared slice of V to obtain a corresponding output slice; no communication is required between the warp. The speed can also be increased by reducing the read-write of the shared memory.

After extracting the features of images with different sizes, three effective feature graphs are obtained and input into a Neck layer, and in order to enhance the expression capability of network features, the embodiment of the invention provides a performable-CCFM in the Neck layer, and the output of an Encoder is adjusted back to two dimensions and marked as F5 so as to finish the subsequent cross-scale feature fusion. As shown in fig. 4, the Fusion of the features is completed by a Fusion module using S3, S4, and F5 as inputs of a formable-CCFM. As shown in FIG. 5, F5 is first taken as F in the Fusion module _high S4 as F _low We first put F _high Upsampling to sum F _low The feature map of the same size is then identical to F _low Adding on channels, then performing 1×1 convolution, reducing the number of channels to the previous dimension, then dividing the output into two parts, and performing feature interaction on one part through n repeated Repvgg-block blocks, wherein three branches are formed by Repvgg-block as shown in FIG. 6: a main convolution kernel of 3x3 sizeBranches, a branch with convolution kernel size of 1×1 and a branch connected only with BN, and the three branches are added element by element, and finally the PRelu activation function is performed. The other part is a residual edge part, which is directly connected with the output, and finally the two parts are added element by element. Taking the fused output as F _high S3 is taken as F _low The above Fusion step was performed in the same manner. The invention introduces multi-scale feature fusion to enhance the capability of small target detection and improve the small target detection precision. And flattening the finally fused feature map into two dimensions, and selecting Top-K features from the encoder to initialize the target query by query selection, wherein K=100 is defaulted.

The Fusion module firstly carries out up-sampling operation on the high-level characteristic diagram, then carries out channel addition with the low-level characteristic diagram, then carries out 1X 1 convolution to reduce the number of channels to the previous dimension, then divides output into two parts, and carries out characteristic interaction on one part through n repeated improved Repvgg-block blocks, wherein when the Relu activation function in the Repvggblock processes the negative number part, the output is constant 0, so that the gradient vanishes, and the PRelu activation function can adjust the slope of the negative number part through learning parameters, so that the gradient decline problem is avoided.

Step S3: smooth-L1 and DIoULSs were used as loss functions for the improved DETR lightweight feature extraction network based on Flashatttion-2.

100 target queries are taken as input of a decoder, 100 token after attention and mapping are output, and then the token is put into two FFNs simultaneously, so that the position and class scores of 100 frames can be obtained. And finally, carrying out bipartite graph matching on the prediction frame and the real frame, and calculating a loss function. In the invention, the regression loss of the original DETR is optimized, and in order to improve the detection accuracy, prediction regression is performed on the detection frame by combining the Smooth-L1 and the DIoU loss function as the regression loss.

Wherein the method comprises the steps of，b _σ(i) The target box representing the i-th index,representing the prediction box of the i-th index.

The smoth-L1 loss function uses only the abscissa and ordinate values and the length and width values of the prediction frame and the target frame when calculating the loss, and cannot describe whether a containment relationship exists between the prediction frame and the target frame. For this problem, a DIoU loss function is introduced in calculating the regression loss to calculate the overlap loss between the prediction box and the target box.

Wherein b, b ^gt The center points of the prediction frame and the target frame are represented, respectively, and ρ represents the euclidean distance between the two center points, and c represents the diagonal distance of the minimum rectangle that can cover both the prediction frame and the target frame.

Step S4: training the improved DETR-based lightweight feature extraction network with the training set; the validation set is input into a trained improved DETR lightweight feature-based extraction network, and the network is evaluated.

Based on the same inventive concept, the present invention also provides an apparatus device comprising a memory and a processor, wherein: a memory for storing a computer program capable of running on the processor; a processor for performing the unmanned-oriented small-target detection method steps based on the improved DETR as described above when running said computer program.

Thus far, the technical solution of the present invention has been described in connection with the specific experimental procedure shown in the drawings, but the scope of protection of the present invention is not limited to these specific embodiments. Equivalent modifications and substitutions for related technical features may be made by those skilled in the art without departing from the principles of the present invention, and such modifications and substitutions will be within the scope of the present invention.

Claims

1. An unmanned aerial vehicle-oriented small target detection method based on improved DETR (det) is characterized by comprising the following steps:

2. The unmanned aerial vehicle-oriented small target detection method based on improved DETR of claim 1, wherein the ShuffleNet-d implementation procedure of step (2) is as follows:

firstly, an initial image firstly passes through a 3 multiplied by 3 convolution layer with the step length of 2, the convolution layer uses a filter to extract the characteristics of the image, and the size of the image is changed into the original 1/2 size; performing maximum pooling operation on the generated feature map, using pooling cores with the size of 2x2, and taking the maximum value in each 2x2 area, so that the space dimension of the feature map is halved, and the image size is changed into the original 1/4 size; the method comprises the steps that a stage module is formed by a shuffle V2 unit 1 and a shuffle V2 unit 2, the repetition times of the units 1 and the units 2 in different stage modules are different, a first block of each stage is formed by the shuffle V2 unit 1, a stride is 2, downsampling operation is completed, and output channels are doubled; in the stage2 module, the times 1 and 3 of repeating the unit 1 and the unit 2 are changed to 1/8 of the original size of the output image; in the stage3 module, the number of times 1 and 7 of repetition of the unit 1 and the unit 2, the output image size becomes 1/16 of the original size; in the stage4 module, the number of times 1 and 3 of repetition of the unit 1 and the unit 2 is repeated, and the output image size is changed to the original 1/32 size; then, the outputs of stage2, stage3 and stage4 are used as multi-scale features, the channel number is unified through 1×1conv, as the input of the multi-scale feature fusion module, the feature map output by stage2 is output as S3 through 1×1conv, the feature map output by stage3 is output as S4 through 1×1conv, and the feature map output by stage4 is output as S5 through 1×1 conv.

3. The unmanned aerial vehicle-oriented small target detection method based on improved DETR of claim 1, wherein the deformable cross-scale feature Fusion module defoforming-CCFM of step (3) completes feature Fusion through a Fusion module; f5 is taken as F _high S4 as F _low F is firstly carried out _high Upsampling to sum F _low The feature map of the same size is then identical to F _low Adding channels, performing 1×1 convolution, reducing the number of channels to the previous dimension, dividing the output into two parts, and performing feature interaction on one part through n repeated Repvgg-block blocks; the other part is a residual edge part which is directly connected with the output, and finally the two parts are added element by element, and the output fused by the first Fusion module is taken as F _high The method comprises the steps of carrying out a first treatment on the surface of the S3 is taken as F _low And similarly, finishing a second Fusion module and outputting a final Fusion feature map.

4. A small target detection method for unmanned aerial vehicle based on improved DETR according to claim 3, wherein the Repvgg block has three branches in parallel, a main branch with a convolution kernel size of 3x3, a branch with a convolution kernel size of 1 x 1 and a branch connected only with BN, and the three branches are added element by element, and finally the three branches are subjected to a pralu activation function.

5. The unmanned aerial vehicle-oriented small target detection method based on the improved DETR of claim 1, wherein the step (4) is implemented by the following formula:

wherein b _σ(i) The target box representing the i-th index,prediction box, b representing the ith index ^gt Representing the center points of the prediction frame and the target frame respectively, wherein ρ represents the Euclidean distance between the two center points; c represents the diagonal distance of the smallest rectangle that can cover both the prediction box and the target box.

6. An apparatus device comprising a memory and a processor, wherein:

a memory for storing a computer program capable of running on the processor;

a processor for performing the unmanned oriented small object detection method steps based on the improved DETR as claimed in any of claims 1-5 when running said computer program.

7. A storage medium having stored thereon a computer program which, when executed by at least one processor, implements the unmanned oriented small target detection method steps based on improved DETR as claimed in any of claims 1-5.