CN115937736A - Small target detection method based on attention and context awareness - Google Patents

Small target detection method based on attention and context awareness Download PDF

Info

Publication number
CN115937736A
CN115937736A CN202211354054.9A CN202211354054A CN115937736A CN 115937736 A CN115937736 A CN 115937736A CN 202211354054 A CN202211354054 A CN 202211354054A CN 115937736 A CN115937736 A CN 115937736A
Authority
CN
China
Prior art keywords
attention
module
small
model
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211354054.9A
Other languages
Chinese (zh)
Inventor
刘梦菲
陆小锋
李克松
毛建华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wenzhou Research Institute Of Shanghai University
University of Shanghai for Science and Technology
Original Assignee
Wenzhou Research Institute Of Shanghai University
University of Shanghai for Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wenzhou Research Institute Of Shanghai University, University of Shanghai for Science and Technology filed Critical Wenzhou Research Institute Of Shanghai University
Priority to CN202211354054.9A priority Critical patent/CN115937736A/en
Publication of CN115937736A publication Critical patent/CN115937736A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Image Analysis (AREA)

Abstract

The invention discloses a small target detection method based on attention and context sensing, which comprises the following steps: s1, collecting a sample, and carrying out sample video collection; s2, sample processing, namely performing primary screening and division on the acquired video data; s3, marking a sample, and marking a data set by using software; and S4, analyzing the model, and providing a training model for training. The method has the advantages that the aspect ratio of the label is detected and adjusted on the target contour, the characteristics of the small target and the characteristics of the related background are connected, the available information of the small target is increased, and the effect of assisting the small target detection by using the context is achieved; by introducing an attention mechanism into the network, the attention degree and the position sensitivity of the model to small targets are improved, and the detection performance of the model is enhanced.

Description

Small target detection method based on attention and context perception
Technical Field
The invention relates to the technical field of electricity, in particular to a small target detection method based on attention and context sensing.
Background
In a real target detection scene, such as aerial images, automatic driving and defect detection, a large number of small targets exist, and the detection of the small targets has important significance for industrial automation. However, in the existing general target detection algorithm, no matter a one-stage or two-stage model, the detection performance of a small target has a very significant difference from that of a medium-scale and large-scale target.
The reason that the difficulty of detecting the small target is high is that the small target occupies less information in the image and the contained information is very limited; the requirement of the small target bounding box for positioning accuracy is higher relative to the large/medium-scale size target; the small targets in the data set occupy less ratio and are difficult to label, and the model optimization focuses on the small targets less; the small targets are easy to generate aggregation phenomenon, and the models cannot be distinguished after down sampling for many times.
Aiming at the difficulty of small target detection, foreign and domestic scholars start from multiple aspects, improve on the basis of a mainstream target detector, and study the skill and the improvement of a small target detection algorithm. For example, an image enhancement method of a spliced image is adopted, a multi-scale image pyramid is introduced, and a small target resolution is improved by using a generated countermeasure network. However, most methods often introduce a large amount of computation and redundant information in order to enhance the ability of the network to detect small targets.
Disclosure of Invention
The invention aims to solve the defects in the prior art and provides a small target detection method based on attention and context perception.
In order to achieve the purpose, the invention adopts the following technical scheme:
a small target detection method based on attention and context perception comprises the following steps:
s1, sample collection, namely providing a video collection module and carrying out sample video collection;
s2, sample processing, namely performing primary screening and division on the collected video data, extracting video frames at equal intervals to obtain a data set, and dividing the data set into a training set and a verification set, wherein the number of the training set and the verification set is 8;
s3, labeling the sample, labeling the data set by using LabelImg software according to the requirement of reading data by the model, cutting the data set into 640 x 640 size by windowing, and naming the obtained data set as IFPS;
and S4, analyzing the model, providing an AECA-YOLO target detection model, and inputting the data set IFPS into the model for training.
Preferably, the AECA-YOLO target detection model network structure includes a backbone network, a neck network, and a head network.
Preferably, a coordinate attention module is added in front of the SPPF module of the backbone network, a residual error module is arranged in the coordinate attention module, the coordinate attention module is decomposed into two directions of horizontal X and vertical Y by channel attention of the residual error module, and one-dimensional feature coding is performed on features of the two directions by means of average pooling, then channels are compressed by splicing and convolution operations in spatial dimensions, spatial information in the horizontal and vertical directions is coded by batch normalization and nonlinear regression, an attention map having both spatial and channel dimensional features is obtained, and finally the spatial information passing through a Sigmoid activation function is fused by means of weighting on the channels.
Preferably, an up-sampling module is arranged in the neck network, and the up-sampling module comprises an attention module and a sub-pixel sampling module.
Preferably, the upsampling module fuses spatial information with respect to the deep feature map x having a shape of H × W × C through the coordinate attention module to obtain a feature map x' having the same shape.
Preferably, the sub-pixel upsampling module has a relationship between any point l ' = (i ', j ') in the output profile x ″ and the corresponding point l = (i, j) in the input profile x for a given upsampling rate σ
Figure BDA0003920269990000031
Preferably, theThe sub-pixel up-sampling module can compress the number of channels of the feature map to C by 1x1 convolution m Followed by k en ×k en The convolution of the magnitude generates an up-sampling kernel related to the position information according to the input feature diagram x', and then the up-sampling kernel is expanded along the channel dimension to obtain the magnitude of
Figure BDA0003920269990000032
Up-sampling kernel ω of l′ Softmax normalization is performed on the upsampled kernel, giving a sum of all channel weights of any l' of 1.
Preferably, the sub-pixel up-sampling module is capable of centering k on the original feature map x by taking a point l as a center up ×k up Small neighborhood and predicted upsampled kernel ω l′ Dot product to obtain the final output high resolution feature pattern x ″, where
Figure BDA0003920269990000033
Different channels at the same location l share the same upsampling kernel: />
Figure BDA0003920269990000034
Preferably, the head network adopts a decoupling detection head to replace a coupling head, a CBL module is arranged in the decoupling detection head, the decoupling detection head reduces the number of channels of the neck network output characteristic diagram to 256 through a 1X1 CBL module, then two parallel branches are used, and each branch passes through 2 CBL convolutional layers respectively to form a classification detection head and a regression detection head.
Preferably, the upsampling module employs an upsampling structure that fuses an attention mechanism and context information.
The invention has the beneficial effects that:
1. by detecting the target contour and adjusting the length-width ratio of the label, the characteristics of the small target and the characteristics of the related background are connected, the available information of the small target is increased, and the effect of assisting the small target detection by using the context is achieved.
2. By introducing an attention mechanism into the network, the attention degree and the position sensitivity of the model to small targets are improved, and the detection performance of the model is enhanced.
Drawings
FIG. 1 is a diagram of an AECA-YOLO target detection model;
FIG. 2 is a diagram of a coordinate attention module configuration;
FIG. 3 is a block diagram of a FAC-UpSample upsampling module;
FIG. 4 is a block diagram of a decoupling detector head;
FIG. 5 is a schematic diagram of a CIoU;
FIG. 6 is a distribution diagram a of a prediction box of an ablation experimental model;
FIG. 7 is a block diagram b of the model prediction of the ablation experiment;
FIG. 8 is a block diagram c of the model prediction of the ablation experiment;
FIG. 9 is a distribution diagram d of a prediction box of an ablation experiment model;
FIG. 10 is a block diagram e of the model prediction of the ablation experiment;
fig. 11 is a predicted box distribution diagram f of the ablation experimental model.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments.
A small target detection method based on attention and context perception comprises the following steps:
s1, collecting samples, providing a video collecting module, carrying out sample video collection, selecting a 1080p video shooting mode by adopting a double light version and longitude and latitude M300 RTK of the Mavic 2 industry of the unmanned aerial vehicle of Xinjiang, and carrying out unmanned aerial vehicle video collection aiming at illegal sea fishing piles;
s2, sample processing, namely performing primary screening and division on the acquired video data, and extracting video frames at equal intervals to obtain a data set, wherein 380 images are counted, the resolution is 1920 multiplied by 1080, and the number of training sets and verification sets is 8;
s3, labeling samples, namely labeling the data set by using LabelImg software according to the requirement of reading data by the model, windowing and cutting the data set into a size of 640 multiplied by 640, ensuring that each target is reserved during cutting, and naming the obtained data set as IFPS;
and S4, analyzing the model, providing an AECA-YOLO target detection model, and inputting the data set IFPS into the model for training.
The AECA-YOLO target detection model network structure comprises a backbone network, a neck network and a head network.
The method comprises the steps that a coordinate attention module is added in front of an SPPF module of a backbone network, so that a feature extraction process focuses more on position information of a small target, a residual error module is arranged in the coordinate attention module, the coordinate attention module is decomposed into two directions of a horizontal direction X and a vertical direction Y through channel attention of the residual error module, one-dimensional feature coding is carried out on features of the two directions through average pooling, splicing and convolution operation compression channels are adopted in spatial dimensions, spatial information of the horizontal direction and the vertical direction is coded through batch normalization and nonlinear regression, attention diagrams with spatial and channel dimension features are obtained, finally spatial information through a Sigmoid activation function is fused in a channel weighting mode, in this way, the coordinate attention system captures channel dependency relationships along one spatial direction, meanwhile, position information is reserved along the other spatial direction, the two are complementary to enhance expression of target features, the coordinate attention system is added in front of the SPPF module, so that the process of fusing local features and global features focuses more on a small target, the position information is low in allusion to small target pixel resolution, low information, the coordinate attention system is added, the basic attention system of a novel network recognition system, and the attention of the network recognition system is suitable for improving the accuracy of the network recognition, and the network attention of the network recognition system, so that the network recognition difficulty is increased.
An up-sampling module is arranged in the neck network and comprises an attention module and a sub-pixel sampling module.
And for the deep feature map x with the shape of H multiplied by W multiplied by C, the up-sampling module fuses spatial information through the coordinate attention module to obtain the feature map x' with the same shape.
The sub-pixel upsampling module has a relationship of any point l ' = (i ', j ') in the output profile x ″ to the corresponding point l = (i, j) in the input profile x for a given upsampling rate σ
Figure BDA0003920269990000061
Figure BDA0003920269990000062
The sub-pixel up-sampling module can compress the number of channels of the feature map to C by 1 multiplied by 1 convolution m Followed by k en ×k en The convolution of the magnitude generates an up-sampling kernel related to the position information according to the input feature diagram x', and then the up-sampling kernel is expanded along the channel dimension to obtain the magnitude of
Figure BDA0003920269990000063
Up-sampling kernel ω of l′ Softmax normalization is performed on the upsampled kernel, making the sum of all channel weights for any l' 1.
The sub-pixel up-sampling module can perform the k taking a point l as a center on the original feature map x up ×k up Small neighborhood and predicted upsampled kernel ω l′ Dot product to obtain the final output high resolution feature pattern x ″, where
Figure BDA0003920269990000064
Different channels at the same location l share the same upsampling kernel: />
Figure BDA0003920269990000071
Each pixel point of the output feature graph fully utilizes the information of the surrounding area, and through the feature recombination and the channel enhancement, compared with the nearest upsampling originally adopted by the network, the FAC-UpSample (Fusing Attention and Context) enables the feature graph to have larger receptive field, language and ContextMeaning information is richer, context information is very important for detecting small targets, context information and an attention mechanism are aggregated in an up-sampling module, and the attention-enhanced channel features are recombined by adopting sub-pixel convolution, so that a high-resolution feature map with richer details is obtained, and the structure of the feature map is shown in fig. 3.
The head network adopts a decoupling detection head to replace a coupling head, a CBL module is arranged in the decoupling detection head, the decoupling detection head firstly reduces the number of channels of a neck network output feature map to 256 through the 1X1 CBL module, then two parallel branches are used, each branch passes through 2 CBL convolution layers respectively to form a classification detection head and a regression detection head, and the decoupling detection head decouples detection tasks, so that the conflict between the two tasks is avoided, and the capability of positioning a small target boundary frame by the network is improved.
In addition to learning the discriminable characteristics of the target itself, the localization loss function in the training process is also important for the small target detection task. However, the position accuracy index is very sensitive to small targets with few pixels, and a slight position deviation can cause the error between the prediction frame and the calibration frame to increase sharply. Therefore, the CIoU loss function is improved, and the severe influence of the position offset on a small target is relieved.
The CIoU is called global Intersection over Union (Complete Intersection Union), and is a standard for evaluating accuracy of a boundary box of an object detection algorithm, for example, as shown in fig. 5, the CIoU fully considers an overlapping area of boxes, a distance between central points, and an aspect ratio on the basis of an IoU, so that convergence speed of a network is increased, and accuracy of a model is improved. The R-CIoU provided by the invention improves the expression of the center point distance aiming at the position sensitivity of the small target on the basis, and the calculation formula is as follows:
Figure BDA0003920269990000081
where ρ (b, b) gt ) The evolution of the Euclidean distance between the central points of the prediction frame and the real frame, and c is the diagonal distance of the minimum closure area of the prediction frame and the real frame.
Figure BDA0003920269990000082
Figure BDA0003920269990000083
(w,h)、(w gt ,h gt ) Width and height of the prediction box and the real box respectively; v is a penalty term for measuring the consistency of the length-width ratio, so that the length-width ratio of the prediction frame is quickly close to the target frame while the prediction frame is quickly close to the target frame, and the accuracy of the prediction position of the model is improved.
Experimental setup: experiments were completed on a high performance computing platform: the system comprises a windows10, a CPU AMD5800x, a GPU RTX3080 and a memory 32G.
Firstly, the following steps: the model has high precision and high speed
The AECA-YOLO small target detection model and the existing mainstream target detection models, namely, fast Rcnn, SSD, centernet, yolov3 and Yolov5, are compared and analyzed on a data set IFPS, AP, AR indexes and FPS of different models are compared, and the high efficiency of the AECA-YOLO is verified. The results of the comparative experiments are shown in table 1:
table 1 shows the performance of the mainstream target detection model on IFPS
Figure BDA0003920269990000091
Figure BDA0003920269990000092
The algorithm is superior to the mainstream target detection algorithm in the aspects of detection precision, recall rate and recognition speed.
II, secondly, the method comprises the following steps: coordinate attention mechanism and enhancement of model performance by context information
In order to further prove the optimization effect of each improvement in the AECA-YOLO algorithm on the detection model, an ablation experiment is carried out by gradually adding improvement measures on the basis of YOLOv5, and the data pairs of the experiment results are shown in table 2:
table 2 shows the AECA-YOLO ablation experiment
Figure BDA0003920269990000101
Analysis table 2 reveals that: after context information is added into a fishing pile label, the omission factor of the model is obviously reduced, the recall rate is improved by 16.22%, and the mAP is also improved by 27.64%; the coordinate attention mechanism is added at the tail end of the backbone network, so that the channel characteristic extracting capability of the network is enhanced, the interference of background noise is effectively reduced, and the precision and the recall rate are both improved slightly; after the new decoupling detection head is adopted, the positioning regression capability of the network is obviously improved, the recall rate is improved by 2.4 percent, the network convergence speed is accelerated, the training process is more stable, but the precision is slightly reduced; after a sub-pixel up-sampling module fusing attention is introduced into the pyramid structure of the neck network, the precision is improved to 91.11%; and finally, optimizing a loss function, improving the position sensitivity of the fishing pile target, and enabling the precision and the recall rate of the model to reach the highest values.
Fig. 6-11 sequentially show the distribution of the last iteration detection frames of the last six models obtained by the ablation experiment, so that the model can be visually seen to reduce the interference of background noise, focus more on small targets, and greatly reduce the false detection rate and the missed detection rate.
After the five improvements are integrated, compared with an original YOLO model, the optimized model AECA-YOLO provided by the method has the advantages that on the basis that the original detection speed is basically kept, the average detection precision is effectively improved by 29.7 percent, the recall rate is improved by 18.9 percent, the FPS is 52.37, the requirement that the real-time detection frame rate is larger than 25 is met, and the improved algorithm AECA-YOLO is more practical.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equivalent or changed within the scope of the present invention.

Claims (9)

1. The small target detection method based on attention and context perception is characterized by comprising the following steps of:
s1, sample collection, namely providing a video collection module and carrying out sample video collection;
s2, sample processing, namely performing primary screening and division on the acquired video data, extracting video frames at equal intervals to obtain a data set, and dividing the data set into a training set and a verification set, wherein the number of the training set and the verification set is 8;
s3, labeling the sample, labeling the data set by using LabelImg software according to the requirement of reading data by the model, cutting the data set into 640 x 640 size by windowing, and naming the obtained data set as IFPS;
and S4, analyzing the model, providing an AECA-YOLO target detection model, and inputting the data set IFPS into the model for training.
2. The attention and context-aware based small object detection method according to claim 1, wherein the AECA-YOLO object detection model network structure comprises a backbone network, a neck network and a head network.
3. The small target detection method based on attention and context awareness according to claim 2, wherein a coordinate attention module is added in front of an SPPF module of the backbone network, a residual module is arranged in the coordinate attention module, the coordinate attention module is decomposed into two directions of horizontal X and vertical Y through channel attention of the residual module, average pooling is adopted for feature coding in the two directions, then channels are compressed in spatial dimension by splicing and convolution operations, spatial information in the horizontal and vertical directions is coded by batch normalization and nonlinear regression, an attention map having features in both spatial and channel dimensions is obtained, and finally spatial information passing through a Sigmoid activation function is fused in a channel weighting mode.
4. The small object detection method based on attention and context awareness according to claim 2, wherein an upsampling module is arranged in the neck network, and the upsampling module comprises an attention module and a sub-pixel sampling module.
5. The method for detecting small objects based on attention and context awareness according to claim 4, wherein the upsampling module fuses spatial information with respect to a deep feature map x with a shape of H × W × C through a coordinate attention module to obtain a feature map x' with the same shape.
6. The attention and context-aware-based small object detection method according to claim 4, wherein the sub-pixel upsampling module has a relation of any point l ' = (i ', j ') in the output feature map x "to the corresponding point l = (i, j) in the input feature map x for a given upsampling rate σ
Figure FDA0003920269980000021
7. The small-object detection method based on attention and context-awareness according to claim 6, wherein the sub-pixel upsampling module is capable of compressing the number of channels of the feature map to C by 1x1 convolution m Followed by k en ×k en The convolution of the magnitude generates an up-sampling kernel related to the position information according to the input feature diagram x', and then the up-sampling kernel is expanded along the channel dimension to obtain the magnitude of
Figure FDA0003920269980000022
Up-sampling kernel ω of l′ Softmax normalization is performed on the upsampled kernel, giving a sum of all channel weights of any l' of 1.
8. The attention-sum based system according to claim 7The context-aware small target detection method is characterized in that the sub-pixel up-sampling module can perform k taking a point l as a center on an original feature map x up ×k up Small neighborhood and predicted upsampled kernel ω l′ Dot product to obtain the final output high resolution feature pattern x ″, where
Figure FDA0003920269980000023
Different channels at the same location l share the same upsampling kernel: />
Figure FDA0003920269980000031
9. The small-target detection method based on attention and context awareness according to claim 3, wherein a decoupling detection head is adopted in the head network to replace a coupling head, a CBL module is arranged in the decoupling detection head, the decoupling detection head reduces the number of channels of the neck network output feature map to 256 through the 1X1 CBL module, then two parallel branches are used, and each branch passes through 2 CBL convolutional layers respectively to form a classification detection head and a regression detection head.
CN202211354054.9A 2022-11-01 2022-11-01 Small target detection method based on attention and context awareness Pending CN115937736A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211354054.9A CN115937736A (en) 2022-11-01 2022-11-01 Small target detection method based on attention and context awareness

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211354054.9A CN115937736A (en) 2022-11-01 2022-11-01 Small target detection method based on attention and context awareness

Publications (1)

Publication Number Publication Date
CN115937736A true CN115937736A (en) 2023-04-07

Family

ID=86699726

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211354054.9A Pending CN115937736A (en) 2022-11-01 2022-11-01 Small target detection method based on attention and context awareness

Country Status (1)

Country Link
CN (1) CN115937736A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116311082A (en) * 2023-05-15 2023-06-23 广东电网有限责任公司湛江供电局 Wearing detection method and system based on matching of key parts and images
CN117952985A (en) * 2024-03-27 2024-04-30 江西师范大学 Image data processing method based on lifting information multiplexing under defect detection scene

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116311082A (en) * 2023-05-15 2023-06-23 广东电网有限责任公司湛江供电局 Wearing detection method and system based on matching of key parts and images
CN116311082B (en) * 2023-05-15 2023-08-01 广东电网有限责任公司湛江供电局 Wearing detection method and system based on matching of key parts and images
CN117952985A (en) * 2024-03-27 2024-04-30 江西师范大学 Image data processing method based on lifting information multiplexing under defect detection scene

Similar Documents

Publication Publication Date Title
CN108961235B (en) Defective insulator identification method based on YOLOv3 network and particle filter algorithm
CN111126202A (en) Optical remote sensing image target detection method based on void feature pyramid network
CN112215128B (en) FCOS-fused R-CNN urban road environment recognition method and device
CN111461083A (en) Rapid vehicle detection method based on deep learning
CN105654067A (en) Vehicle detection method and device
CN115937736A (en) Small target detection method based on attention and context awareness
CN112070727B (en) Metal surface defect detection method based on machine learning
CN114841244B (en) Target detection method based on robust sampling and mixed attention pyramid
CN112819748B (en) Training method and device for strip steel surface defect recognition model
CN115861772A (en) Multi-scale single-stage target detection method based on RetinaNet
CN116229452B (en) Point cloud three-dimensional target detection method based on improved multi-scale feature fusion
CN115049640B (en) Road crack detection method based on deep learning
CN114049572A (en) Detection method for identifying small target
CN116824543A (en) Automatic driving target detection method based on OD-YOLO
CN117496384A (en) Unmanned aerial vehicle image object detection method
CN116597411A (en) Method and system for identifying traffic sign by unmanned vehicle in extreme weather
CN110909656B (en) Pedestrian detection method and system integrating radar and camera
CN115019201A (en) Weak and small target detection method based on feature refined depth network
CN116503677B (en) Wetland classification information extraction method, system, electronic equipment and storage medium
CN114332473A (en) Object detection method, object detection device, computer equipment, storage medium and program product
CN112084941A (en) Target detection and identification method based on remote sensing image
Zhang et al. Pavement crack detection based on deep learning
CN117197687A (en) Unmanned aerial vehicle aerial photography-oriented detection method for dense small targets
CN110889418A (en) Gas contour identification method
CN116051808A (en) YOLOv 5-based lightweight part identification and positioning method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination