CN115937736A

CN115937736A - Small target detection method based on attention and context awareness

Info

Publication number: CN115937736A
Application number: CN202211354054.9A
Authority: CN
Inventors: 刘梦菲; 陆小锋; 李克松; 毛建华
Original assignee: Wenzhou Research Institute Of Shanghai University; University of Shanghai for Science and Technology
Current assignee: Wenzhou Research Institute Of Shanghai University; University of Shanghai for Science and Technology
Priority date: 2022-11-01
Filing date: 2022-11-01
Publication date: 2023-04-07

Abstract

The invention discloses a small target detection method based on attention and context sensing, which comprises the following steps: s1, collecting a sample, and carrying out sample video collection; s2, sample processing, namely performing primary screening and division on the acquired video data; s3, marking a sample, and marking a data set by using software; and S4, analyzing the model, and providing a training model for training. The method has the advantages that the aspect ratio of the label is detected and adjusted on the target contour, the characteristics of the small target and the characteristics of the related background are connected, the available information of the small target is increased, and the effect of assisting the small target detection by using the context is achieved; by introducing an attention mechanism into the network, the attention degree and the position sensitivity of the model to small targets are improved, and the detection performance of the model is enhanced.

Description

Small target detection method based on attention and context perception

Technical Field

The invention relates to the technical field of electricity, in particular to a small target detection method based on attention and context sensing.

Background

In a real target detection scene, such as aerial images, automatic driving and defect detection, a large number of small targets exist, and the detection of the small targets has important significance for industrial automation. However, in the existing general target detection algorithm, no matter a one-stage or two-stage model, the detection performance of a small target has a very significant difference from that of a medium-scale and large-scale target.

The reason that the difficulty of detecting the small target is high is that the small target occupies less information in the image and the contained information is very limited; the requirement of the small target bounding box for positioning accuracy is higher relative to the large/medium-scale size target; the small targets in the data set occupy less ratio and are difficult to label, and the model optimization focuses on the small targets less; the small targets are easy to generate aggregation phenomenon, and the models cannot be distinguished after down sampling for many times.

Aiming at the difficulty of small target detection, foreign and domestic scholars start from multiple aspects, improve on the basis of a mainstream target detector, and study the skill and the improvement of a small target detection algorithm. For example, an image enhancement method of a spliced image is adopted, a multi-scale image pyramid is introduced, and a small target resolution is improved by using a generated countermeasure network. However, most methods often introduce a large amount of computation and redundant information in order to enhance the ability of the network to detect small targets.

Disclosure of Invention

The invention aims to solve the defects in the prior art and provides a small target detection method based on attention and context perception.

In order to achieve the purpose, the invention adopts the following technical scheme:

a small target detection method based on attention and context perception comprises the following steps:

s1, sample collection, namely providing a video collection module and carrying out sample video collection;

s2, sample processing, namely performing primary screening and division on the collected video data, extracting video frames at equal intervals to obtain a data set, and dividing the data set into a training set and a verification set, wherein the number of the training set and the verification set is 8;

s3, labeling the sample, labeling the data set by using LabelImg software according to the requirement of reading data by the model, cutting the data set into 640 x 640 size by windowing, and naming the obtained data set as IFPS;

and S4, analyzing the model, providing an AECA-YOLO target detection model, and inputting the data set IFPS into the model for training.

Preferably, the AECA-YOLO target detection model network structure includes a backbone network, a neck network, and a head network.

Preferably, a coordinate attention module is added in front of the SPPF module of the backbone network, a residual error module is arranged in the coordinate attention module, the coordinate attention module is decomposed into two directions of horizontal X and vertical Y by channel attention of the residual error module, and one-dimensional feature coding is performed on features of the two directions by means of average pooling, then channels are compressed by splicing and convolution operations in spatial dimensions, spatial information in the horizontal and vertical directions is coded by batch normalization and nonlinear regression, an attention map having both spatial and channel dimensional features is obtained, and finally the spatial information passing through a Sigmoid activation function is fused by means of weighting on the channels.

Preferably, an up-sampling module is arranged in the neck network, and the up-sampling module comprises an attention module and a sub-pixel sampling module.

Preferably, the upsampling module fuses spatial information with respect to the deep feature map x having a shape of H × W × C through the coordinate attention module to obtain a feature map x' having the same shape.

Preferably, the sub-pixel upsampling module has a relationship between any point l ' = (i ', j ') in the output profile x ″ and the corresponding point l = (i, j) in the input profile x for a given upsampling rate σ

Preferably, theThe sub-pixel up-sampling module can compress the number of channels of the feature map to C by 1x1 convolution _m Followed by k _en ×k _en The convolution of the magnitude generates an up-sampling kernel related to the position information according to the input feature diagram x', and then the up-sampling kernel is expanded along the channel dimension to obtain the magnitude of

Up-sampling kernel ω of _l′ Softmax normalization is performed on the upsampled kernel, giving a sum of all channel weights of any l' of 1.

Preferably, the sub-pixel up-sampling module is capable of centering k on the original feature map x by taking a point l as a center _up ×k _up Small neighborhood and predicted upsampled kernel ω _l′ Dot product to obtain the final output high resolution feature pattern x ″, where

Different channels at the same location l share the same upsampling kernel: />

Preferably, the head network adopts a decoupling detection head to replace a coupling head, a CBL module is arranged in the decoupling detection head, the decoupling detection head reduces the number of channels of the neck network output characteristic diagram to 256 through a 1X1 CBL module, then two parallel branches are used, and each branch passes through 2 CBL convolutional layers respectively to form a classification detection head and a regression detection head.

Preferably, the upsampling module employs an upsampling structure that fuses an attention mechanism and context information.

The invention has the beneficial effects that:

1. by detecting the target contour and adjusting the length-width ratio of the label, the characteristics of the small target and the characteristics of the related background are connected, the available information of the small target is increased, and the effect of assisting the small target detection by using the context is achieved.

2. By introducing an attention mechanism into the network, the attention degree and the position sensitivity of the model to small targets are improved, and the detection performance of the model is enhanced.

Drawings

FIG. 1 is a diagram of an AECA-YOLO target detection model;

FIG. 2 is a diagram of a coordinate attention module configuration;

FIG. 3 is a block diagram of a FAC-UpSample upsampling module;

FIG. 4 is a block diagram of a decoupling detector head;

FIG. 5 is a schematic diagram of a CIoU;

FIG. 6 is a distribution diagram a of a prediction box of an ablation experimental model;

FIG. 7 is a block diagram b of the model prediction of the ablation experiment;

FIG. 8 is a block diagram c of the model prediction of the ablation experiment;

FIG. 9 is a distribution diagram d of a prediction box of an ablation experiment model;

FIG. 10 is a block diagram e of the model prediction of the ablation experiment;

fig. 11 is a predicted box distribution diagram f of the ablation experimental model.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments.

s1, collecting samples, providing a video collecting module, carrying out sample video collection, selecting a 1080p video shooting mode by adopting a double light version and longitude and latitude M300 RTK of the Mavic 2 industry of the unmanned aerial vehicle of Xinjiang, and carrying out unmanned aerial vehicle video collection aiming at illegal sea fishing piles;

s2, sample processing, namely performing primary screening and division on the acquired video data, and extracting video frames at equal intervals to obtain a data set, wherein 380 images are counted, the resolution is 1920 multiplied by 1080, and the number of training sets and verification sets is 8;

s3, labeling samples, namely labeling the data set by using LabelImg software according to the requirement of reading data by the model, windowing and cutting the data set into a size of 640 multiplied by 640, ensuring that each target is reserved during cutting, and naming the obtained data set as IFPS;

The AECA-YOLO target detection model network structure comprises a backbone network, a neck network and a head network.

The method comprises the steps that a coordinate attention module is added in front of an SPPF module of a backbone network, so that a feature extraction process focuses more on position information of a small target, a residual error module is arranged in the coordinate attention module, the coordinate attention module is decomposed into two directions of a horizontal direction X and a vertical direction Y through channel attention of the residual error module, one-dimensional feature coding is carried out on features of the two directions through average pooling, splicing and convolution operation compression channels are adopted in spatial dimensions, spatial information of the horizontal direction and the vertical direction is coded through batch normalization and nonlinear regression, attention diagrams with spatial and channel dimension features are obtained, finally spatial information through a Sigmoid activation function is fused in a channel weighting mode, in this way, the coordinate attention system captures channel dependency relationships along one spatial direction, meanwhile, position information is reserved along the other spatial direction, the two are complementary to enhance expression of target features, the coordinate attention system is added in front of the SPPF module, so that the process of fusing local features and global features focuses more on a small target, the position information is low in allusion to small target pixel resolution, low information, the coordinate attention system is added, the basic attention system of a novel network recognition system, and the attention of the network recognition system is suitable for improving the accuracy of the network recognition, and the network attention of the network recognition system, so that the network recognition difficulty is increased.

An up-sampling module is arranged in the neck network and comprises an attention module and a sub-pixel sampling module.

And for the deep feature map x with the shape of H multiplied by W multiplied by C, the up-sampling module fuses spatial information through the coordinate attention module to obtain the feature map x' with the same shape.

The sub-pixel upsampling module has a relationship of any point l ' = (i ', j ') in the output profile x ″ to the corresponding point l = (i, j) in the input profile x for a given upsampling rate σ

The sub-pixel up-sampling module can compress the number of channels of the feature map to C by 1 multiplied by 1 convolution _m Followed by k _en ×k _en The convolution of the magnitude generates an up-sampling kernel related to the position information according to the input feature diagram x', and then the up-sampling kernel is expanded along the channel dimension to obtain the magnitude of

Up-sampling kernel ω of _l′ Softmax normalization is performed on the upsampled kernel, making the sum of all channel weights for any l' 1.

The sub-pixel up-sampling module can perform the k taking a point l as a center on the original feature map x _up ×k _up Small neighborhood and predicted upsampled kernel ω _l′ Dot product to obtain the final output high resolution feature pattern x ″, where

Different channels at the same location l share the same upsampling kernel: />

Each pixel point of the output feature graph fully utilizes the information of the surrounding area, and through the feature recombination and the channel enhancement, compared with the nearest upsampling originally adopted by the network, the FAC-UpSample (Fusing Attention and Context) enables the feature graph to have larger receptive field, language and ContextMeaning information is richer, context information is very important for detecting small targets, context information and an attention mechanism are aggregated in an up-sampling module, and the attention-enhanced channel features are recombined by adopting sub-pixel convolution, so that a high-resolution feature map with richer details is obtained, and the structure of the feature map is shown in fig. 3.

The head network adopts a decoupling detection head to replace a coupling head, a CBL module is arranged in the decoupling detection head, the decoupling detection head firstly reduces the number of channels of a neck network output feature map to 256 through the 1X1 CBL module, then two parallel branches are used, each branch passes through 2 CBL convolution layers respectively to form a classification detection head and a regression detection head, and the decoupling detection head decouples detection tasks, so that the conflict between the two tasks is avoided, and the capability of positioning a small target boundary frame by the network is improved.

In addition to learning the discriminable characteristics of the target itself, the localization loss function in the training process is also important for the small target detection task. However, the position accuracy index is very sensitive to small targets with few pixels, and a slight position deviation can cause the error between the prediction frame and the calibration frame to increase sharply. Therefore, the CIoU loss function is improved, and the severe influence of the position offset on a small target is relieved.

The CIoU is called global Intersection over Union (Complete Intersection Union), and is a standard for evaluating accuracy of a boundary box of an object detection algorithm, for example, as shown in fig. 5, the CIoU fully considers an overlapping area of boxes, a distance between central points, and an aspect ratio on the basis of an IoU, so that convergence speed of a network is increased, and accuracy of a model is improved. The R-CIoU provided by the invention improves the expression of the center point distance aiming at the position sensitivity of the small target on the basis, and the calculation formula is as follows:

where ρ (b, b) ^gt ) The evolution of the Euclidean distance between the central points of the prediction frame and the real frame, and c is the diagonal distance of the minimum closure area of the prediction frame and the real frame.

(w,h)、(w _gt ,h _gt ) Width and height of the prediction box and the real box respectively; v is a penalty term for measuring the consistency of the length-width ratio, so that the length-width ratio of the prediction frame is quickly close to the target frame while the prediction frame is quickly close to the target frame, and the accuracy of the prediction position of the model is improved.

Experimental setup: experiments were completed on a high performance computing platform: the system comprises a windows10, a CPU AMD5800x, a GPU RTX3080 and a memory 32G.

Firstly, the following steps: the model has high precision and high speed

The AECA-YOLO small target detection model and the existing mainstream target detection models, namely, fast Rcnn, SSD, centernet, yolov3 and Yolov5, are compared and analyzed on a data set IFPS, AP, AR indexes and FPS of different models are compared, and the high efficiency of the AECA-YOLO is verified. The results of the comparative experiments are shown in table 1:

table 1 shows the performance of the mainstream target detection model on IFPS

The algorithm is superior to the mainstream target detection algorithm in the aspects of detection precision, recall rate and recognition speed.

II, secondly, the method comprises the following steps: coordinate attention mechanism and enhancement of model performance by context information

In order to further prove the optimization effect of each improvement in the AECA-YOLO algorithm on the detection model, an ablation experiment is carried out by gradually adding improvement measures on the basis of YOLOv5, and the data pairs of the experiment results are shown in table 2:

table 2 shows the AECA-YOLO ablation experiment

Analysis table 2 reveals that: after context information is added into a fishing pile label, the omission factor of the model is obviously reduced, the recall rate is improved by 16.22%, and the mAP is also improved by 27.64%; the coordinate attention mechanism is added at the tail end of the backbone network, so that the channel characteristic extracting capability of the network is enhanced, the interference of background noise is effectively reduced, and the precision and the recall rate are both improved slightly; after the new decoupling detection head is adopted, the positioning regression capability of the network is obviously improved, the recall rate is improved by 2.4 percent, the network convergence speed is accelerated, the training process is more stable, but the precision is slightly reduced; after a sub-pixel up-sampling module fusing attention is introduced into the pyramid structure of the neck network, the precision is improved to 91.11%; and finally, optimizing a loss function, improving the position sensitivity of the fishing pile target, and enabling the precision and the recall rate of the model to reach the highest values.

Fig. 6-11 sequentially show the distribution of the last iteration detection frames of the last six models obtained by the ablation experiment, so that the model can be visually seen to reduce the interference of background noise, focus more on small targets, and greatly reduce the false detection rate and the missed detection rate.

After the five improvements are integrated, compared with an original YOLO model, the optimized model AECA-YOLO provided by the method has the advantages that on the basis that the original detection speed is basically kept, the average detection precision is effectively improved by 29.7 percent, the recall rate is improved by 18.9 percent, the FPS is 52.37, the requirement that the real-time detection frame rate is larger than 25 is met, and the improved algorithm AECA-YOLO is more practical.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equivalent or changed within the scope of the present invention.

Claims

1. The small target detection method based on attention and context perception is characterized by comprising the following steps of:

s2, sample processing, namely performing primary screening and division on the acquired video data, extracting video frames at equal intervals to obtain a data set, and dividing the data set into a training set and a verification set, wherein the number of the training set and the verification set is 8;

2. The attention and context-aware based small object detection method according to claim 1, wherein the AECA-YOLO object detection model network structure comprises a backbone network, a neck network and a head network.

3. The small target detection method based on attention and context awareness according to claim 2, wherein a coordinate attention module is added in front of an SPPF module of the backbone network, a residual module is arranged in the coordinate attention module, the coordinate attention module is decomposed into two directions of horizontal X and vertical Y through channel attention of the residual module, average pooling is adopted for feature coding in the two directions, then channels are compressed in spatial dimension by splicing and convolution operations, spatial information in the horizontal and vertical directions is coded by batch normalization and nonlinear regression, an attention map having features in both spatial and channel dimensions is obtained, and finally spatial information passing through a Sigmoid activation function is fused in a channel weighting mode.

4. The small object detection method based on attention and context awareness according to claim 2, wherein an upsampling module is arranged in the neck network, and the upsampling module comprises an attention module and a sub-pixel sampling module.

5. The method for detecting small objects based on attention and context awareness according to claim 4, wherein the upsampling module fuses spatial information with respect to a deep feature map x with a shape of H × W × C through a coordinate attention module to obtain a feature map x' with the same shape.

6. The attention and context-aware-based small object detection method according to claim 4, wherein the sub-pixel upsampling module has a relation of any point l ' = (i ', j ') in the output feature map x "to the corresponding point l = (i, j) in the input feature map x for a given upsampling rate σ

7. The small-object detection method based on attention and context-awareness according to claim 6, wherein the sub-pixel upsampling module is capable of compressing the number of channels of the feature map to C by 1x1 convolution _m Followed by k _en ×k _en The convolution of the magnitude generates an up-sampling kernel related to the position information according to the input feature diagram x', and then the up-sampling kernel is expanded along the channel dimension to obtain the magnitude of

8. The attention-sum based system according to claim 7The context-aware small target detection method is characterized in that the sub-pixel up-sampling module can perform k taking a point l as a center on an original feature map x _up ×k _up Small neighborhood and predicted upsampled kernel ω _l′ Dot product to obtain the final output high resolution feature pattern x ″, where

Different channels at the same location l share the same upsampling kernel: />

9. The small-target detection method based on attention and context awareness according to claim 3, wherein a decoupling detection head is adopted in the head network to replace a coupling head, a CBL module is arranged in the decoupling detection head, the decoupling detection head reduces the number of channels of the neck network output feature map to 256 through the 1X1 CBL module, then two parallel branches are used, and each branch passes through 2 CBL convolutional layers respectively to form a classification detection head and a regression detection head.