CN118015388A

CN118015388A - Small target detection method, device and storage medium

Info

Publication number: CN118015388A
Application number: CN202410424489.9A
Authority: CN
Inventors: 覃仁超; 张岚; 叶承卓; 何飞
Original assignee: Southwest University of Science and Technology
Current assignee: Southwest University of Science and Technology
Priority date: 2024-04-10
Filing date: 2024-04-10
Publication date: 2024-05-10

Abstract

The application provides a small target detection method, a device and a storage medium, relating to the technical field of image detection, comprising the following steps: acquiring a picture to be detected containing a small target, and inputting the picture to be detected into a small target detection model; the model comprises: a feature extraction network added with an attention mechanism comprises a region suggestion network which presets the size and proportion of an anchor frame and a loss function of a newly added area ratio penalty term and an aspect ratio penalty term; extracting a feature image containing small target information from the image to be detected through a feature extraction network; outputting a plurality of candidate frames based on the feature map through a regional suggestion network, and filtering to obtain a prediction frame; and obtaining a target probability score and a bounding box regression score through the region-of-interest layer and the full-connection layer to obtain a small target detection box and a class corresponding to the picture to be detected. The application improves the original small target detection model, improves the detection precision of the model and reduces the omission ratio.

Description

Small target detection method, device and storage medium

Technical Field

The present application relates to the field of image processing technologies, and in particular, to a method and apparatus for detecting a small target, and a storage medium.

Background

In recent years, the technology of combining deep learning with target detection has tended to be mature, but the effect is not ideal in small target detection, and currently, small target detection has been involved in many fields, such as defect detection, aerial image analysis, intelligent medical treatment and the like, and plays an important role. Because of the small target pixel occupation ratio, the low semantic information and the noise interference, the small target detection is still a great difficulty in the target detection field. The fast R-CNN is used as a classical two-stage target detection algorithm, the characteristic information of the target is extracted from a main network, a candidate frame is generated in an RPN network, and finally, the target class is judged by sending the target class into a full-connection layer after ROI Pooling layers of synthesis proposals and feature maps. Although the method improves the speed and the precision of target detection to a certain extent, smaller local characteristics cannot be considered, and the general size and proportion of the anchor frame cannot be ensured to be suitable for a small target data set, so that the problems of low detection precision, high omission ratio and the like are caused.

Disclosure of Invention

The application aims to provide a small target detection method, a device and a storage medium, which are used for improving an original small target detection model, improving the extraction precision of small local features through a feature extraction network added with an MS-ECA attention mechanism, ensuring that the size and the proportion of an anchor frame are suitable for a small target data set through cluster analysis, comprehensively improving the detection precision of the model by combining a loss function taking the area ratio and the aspect ratio as new punishment items, and reducing the omission ratio.

In a first aspect, the present application provides a method for detecting a small target, the method comprising: acquiring a picture to be detected containing a small target; inputting the picture to be detected into a small target detection model; the small target detection model comprises a feature extraction network added with an MS-ECA attention mechanism, and comprises a region suggestion network for presetting anchor frame sizes and proportions and adding a loss function of an area ratio penalty term and an aspect ratio penalty term; the size and the proportion of the preset anchor frame are obtained by carrying out cluster analysis on the small target training sample set through a clustering algorithm; extracting a feature image containing small target information from the image to be detected through a feature extraction network; outputting a plurality of candidate frames based on the extracted feature map through a regional suggestion network, and filtering the plurality of candidate frames to obtain a prediction frame for calculation; and adjusting the prediction frame through the region-of-interest layer, and obtaining a target probability score and a bounding box regression score after the prediction frame passes through the full-connection layer so as to obtain a small target detection frame and a class corresponding to the picture to be detected.

Further, the step of extracting, through the feature extraction network, a feature map containing small target information in the picture to be detected includes: extracting a first feature map of a picture to be detected through a backbone network in a feature extraction network; dividing the first feature map into a plurality of groups of feature maps according to the channel, and respectively convolving the plurality of groups of feature maps by convolution cores with different sizes to obtain a plurality of groups of second feature maps; inputting a plurality of groups of second feature images obtained after convolution into an ECA attention mechanism to obtain different weights, and multiplying each weight with the corresponding second feature image to obtain third feature images of different groups; and splicing the third feature images of different groups on the channel to obtain the feature images containing the small target information.

Further, the training process of the small target detection model is as follows: acquiring a small target training sample set; performing cluster analysis on the small target training sample set by adopting a K-means++ clustering algorithm, and determining the size and proportion of an anchor frame suitable for the training set; inputting samples in the small target training sample set into an initial detection model corresponding to the small target detection model; extracting sample characteristics through a characteristic extraction network added with an MS-ECA attention mechanism in an initial detection model; processing the sample characteristics through a regional suggestion network configured with the size and the proportion of the anchor frame to obtain a prediction frame; determining a loss value based on a loss function including an area ratio and an aspect ratio of the prediction frame and the real frame; and adjusting parameters of the initial detection model based on the loss value until the initial detection model converges to obtain a small target detection model.

Further, the step of obtaining the small target training sample set includes: acquiring a small target data set; performing scaling treatment and picture data enhancement treatment on pictures in the small target data set to obtain a small target training sample set; wherein the scaling process includes: scaling the picture with the real frame width and height value of the small target exceeding the threshold value according to the scaling ratio; the scaling ratio is equal to the minimum value in the target real frame width and height values divided by a preset value; the picture data enhancement processing includes: and (5) rotating, overturning and adjusting HSV values of each picture.

Further, the step of performing cluster analysis on the small target training sample set by adopting the K-means++ cluster algorithm to determine the size and proportion of the anchor frame suitable for the training set comprises the following steps: acquiring the area of a real frame corresponding to a small target in each sample in a training set; determining that two real frames belong to one cluster according to the fact that the area difference of the real frames in the two samples is smaller than or equal to a preset area threshold value, otherwise, judging all samples according to the principle that the two real frames do not belong to one cluster, and determining the number of clusters; carrying out cluster center initialization processing on samples in the training set to obtain initialized cluster centers with the same number as the number of clusters; updating the plurality of initialized cluster centers to obtain a plurality of updated cluster centers; and determining the size and the proportion of the anchor frame suitable for the training set based on the target real frames respectively corresponding to the plurality of updated clustering centers.

Further, the step of initializing the cluster centers for the samples in the training set to obtain the initialized cluster centers with the same number as the number of clusters includes: randomly selecting a sample from the training set as the center of the current cluster; a cluster center determining step is carried out based on the current cluster center: calculating IoU values between each sample and the current cluster center, taking the sample corresponding to the minimum IoU value as a new current cluster center, continuing to execute the cluster center determining step until the number of the determined current cluster centers reaches the number of clusters, and determining the determined current cluster centers as initialized cluster centers.

Further, the step of updating the plurality of initialized cluster centers to obtain a plurality of updated cluster centers includes: performing a sample partitioning step based on the plurality of initialized cluster centers: calculating IoU values of the samples and each initialized cluster center for each sample; the sample is attributed to a cluster corresponding to an initialization cluster center with the maximum IoU value; updating the cluster center in each cluster according to the median value in the cluster; the median value is a sample of the plurality of samples in the cluster, which corresponds to a IoU value in the middle position in IoU values of the center of the cluster; the sample partitioning step is continued based on the updated cluster centers until the samples in each cluster are no longer changing, and the cluster center in each cluster at this time is determined as the updated cluster center.

Further, the step of determining the loss value based on the loss function including the aspect ratio and the area ratio of the prediction frame and the real frame includes: the loss value AWH-IoU is calculated according to the following equation:

；

wherein IoU represents the area intersection ratio between the predicted frame and the real frame in the output result; and/> Representing the central coordinate values of the predicted frame and the real frame respectively,/>Representing the center distance of the predicted and real frames,Representing the diagonal distance that can encompass the smallest closed region of the prediction and real frames,/>And/>Representing the areas of the prediction and real frames, respectively,/>，/>The width and height values of the real frame and the predicted frame are respectively represented; /(I)Representing an exponential function.

In a second aspect, the present application also provides a small object detection apparatus, the apparatus comprising a plurality of modules for performing the steps of the small object detection method according to any one of the first aspect, the plurality of modules comprising a picture acquisition module, a picture input module, a feature extraction module and a prediction module, wherein: the image acquisition module is used for acquiring an image to be detected containing a small target; the picture input module is used for inputting a picture to be detected into the small target detection model; the small target detection model comprises a feature extraction network added with an MS-ECA attention mechanism, and comprises a region suggestion network for presetting anchor frame sizes and proportions and adding a loss function of an area ratio penalty term and an aspect ratio penalty term; the size and the proportion of the preset anchor frame are obtained by carrying out cluster analysis on the small target training sample set through a clustering algorithm; the feature extraction module is used for extracting a feature image containing small target information in the image to be detected through a feature extraction network; the prediction module is used for outputting a plurality of candidate frames based on the extracted feature images through the regional suggestion network, and filtering the candidate frames to obtain a prediction frame for calculation; and adjusting the prediction frame through the region-of-interest layer, and obtaining a target probability score and a bounding box regression score after the prediction frame passes through the full-connection layer so as to obtain a small target detection frame and a class corresponding to the picture to be detected.

In a third aspect, the present application also provides a computer readable storage medium storing computer executable instructions which, when invoked and executed by a processor, cause the processor to implement the method of the first aspect.

In the small target detection method, the small target detection device and the storage medium provided by the application, firstly, a picture to be detected containing a small target is acquired, and the picture to be detected is input into a small target detection model; the small target detection model comprises a feature extraction network added with an MS-ECA attention mechanism, and a region suggestion network comprising a preset anchor frame size and proportion and a loss function of a newly added area ratio penalty term and an aspect ratio penalty term; the size and the proportion of the preset anchor frame are anchor frames obtained by carrying out cluster analysis on the small target training sample set through a clustering algorithm; then extracting a feature map containing small target information from the picture to be detected through a feature extraction network; outputting a plurality of candidate frames based on the extracted feature map through a regional suggestion network, and filtering the plurality of candidate frames to obtain a prediction frame for calculation; and adjusting the prediction frame through the region-of-interest layer, and obtaining a target probability score and a bounding box regression score after the prediction frame passes through the full-connection layer so as to obtain a small target detection frame and a class corresponding to the picture to be detected. The application can improve the original small target detection model, improves the extraction precision of small local features by adding the feature extraction network of an MS-ECA attention mechanism, and the size and the proportion of the anchor frame determined by the cluster analysis are suitable for a small target data set, and comprehensively improves the detection precision of the model and reduces the omission ratio by combining the area ratio and the aspect ratio as a loss function of a new punishment item.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present application, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a small target detection method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a Faster R-CNN according to an embodiment of the present application;

FIG. 3 is a schematic diagram of one embodiment Resnet-50 of the present application;

Fig. 4 is a schematic diagram of an ms_eca attention mechanism according to an embodiment of the present application;

FIG. 5 is a flowchart of a K-means++ clustering algorithm provided by an embodiment of the present application;

FIG. 6 is a comparative schematic of the results provided in the examples of the present application;

fig. 7 is a block diagram of a small target detection device according to an embodiment of the present application.

Detailed Description

The technical solutions of the present application will be clearly and completely described in connection with the embodiments, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

At present, because of the problem that small target features are difficult to extract, attention mechanisms are widely applied to the field of target detection as a plug-and-play structure, but several attention mechanisms are currently popular, such as: SE, CBAM, ECA and the like, which mainly amplify and enhance the characteristics of the input characteristic diagram under the same size, the characteristic enhancement effect of the operation on the small target is not obvious; in addition, recently, a method for acquiring a high-resolution image against a generation network has appeared, which, although enhancing the information of a small object on the image, greatly increases the burden of detecting the network, and is therefore not suitable for the two-stage detection method. Secondly, the suggestion box is the most visual representation of the detection effect, and in the candidate box screening stage, the most applied loss function is: ioU, G-IoU, D-IoU, C-IoU and the like, and the loss functions can perform good filtering effect on candidate frames of medium and large targets to a certain extent, but cannot perform good effect on small targets, particularly small targets with obvious aspect ratio.

In summary, the existing two-stage small target detection method mainly faces the following problems:

1. the design of the size and proportion of the anchor frame is not suitable for small targets;

2. the feature extraction part of the small target can not extract abundant small target information;

3. And in the screening part of the candidate frames, the candidate frames with poor effect can also be used as prediction frames to participate in subsequent operations due to the fact that the punishment and withdrawal force of the loss function is too small.

Based on the above, the embodiment of the application provides a small target detection method, a device and a storage medium, which are used for improving an original small target detection model, improving the extraction precision of small local features through a feature extraction network added with an MS-ECA attention mechanism, ensuring that the size and the proportion of an anchor frame are suitable for a small target data set through cluster analysis, comprehensively improving the detection precision of the model by combining a loss function taking the area ratio and the aspect ratio as new punishment items, and reducing the omission ratio.

For the sake of understanding the present embodiment, a detailed description will be given of a small target detection method disclosed in the embodiment of the present application.

Fig. 1 is a flowchart of a small target detection method according to an embodiment of the present application, where the method includes the following steps:

step S102, obtaining a picture to be detected containing a small target; the small target in the picture to be detected can be cancer cells or different types such as license plates, that is, the small target detection model provided by the application can be a cancer cell detection model or a license plate detection model.

Step S104, inputting a picture to be detected into a small target detection model; the small target detection model comprises a feature extraction network added with an MS-ECA (Multi-SCALE SPLIT EFFICIENT CHANNEL Attention) mechanism, and a region suggestion network comprising a preset anchor frame size and proportion and a loss function of a newly added area ratio penalty term and an aspect ratio penalty term; the size and the proportion of the preset anchor frame are obtained by carrying out cluster analysis on the small target training sample set through a clustering algorithm;

The small target detection model in the embodiment of the application is obtained by improving the model based on the Faster R-CNN (the original structure of which is shown in figure 2), and an MS-ECA attention mechanism is added in a feature extraction network, so that the extracted small target information can be enriched; in the loss function used in the regional suggestion network, adding a new penalty term composed of the area ratio and the aspect ratio of the candidate frame and the real frame; processing the candidate frames based on the preset anchor frame size and the preset anchor frame proportion during filtering, wherein the preset anchor frame size and the preset anchor frame proportion are obtained by carrying out cluster analysis on a small target training sample set through a clustering algorithm; therefore, the method can play a good effect aiming at a small target with obvious aspect ratio and improve model prediction accuracy.

And S106, extracting a feature map containing small target information from the picture to be detected through a feature extraction network.

Feature extraction is performed through a feature extraction network added with an MS-ECA attention mechanism, so that a feature map containing richer small target information can be obtained.

Step S108, outputting a plurality of candidate frames based on the extracted feature images through a regional suggestion network, and filtering the plurality of candidate frames to obtain a prediction frame for calculation; and adjusting the prediction frame through the region-of-interest layer, and obtaining a target probability score and a bounding box regression score after the prediction frame passes through the full-connection layer so as to obtain a small target detection frame and a class corresponding to the picture to be detected.

In order to solve the problem of inaccurate detection of a small target due to low resolution of the small target in a main extraction network part, the method for detecting the small target provided by the embodiment of the application has the main idea that an attention mechanism capable of carrying out characteristic enhancement on multiple scales, namely an MS-ECA attention mechanism is designed and applied to a characteristic extraction part so as to acquire more abundant small target information; secondly, aiming at the candidate frame screening process, the embodiment provides a loss function which takes the area ratio and the length-width ratio between the candidate frame and the real frame as penalty items into consideration, so that the penalty strength of small targets, particularly small targets with obvious length-width ratios, when filtering the candidate frame is enhanced, and the omission ratio of the small targets is reduced; in addition, the size and the proportion of the anchor frame in the model training process are obtained based on the clustering algorithm analysis of the sample set, so that the model training method is more suitable for model training of small targets, and the model detection effect is improved.

The embodiment of the application also provides another small target detection method, which is realized on the basis of the embodiment; the present embodiment focuses on the feature extraction process and the model training process.

The step of extracting the feature map containing the small target information in the picture to be detected through the feature extraction network comprises the following steps:

Extracting a first feature map of a picture to be detected through a backbone network in a feature extraction network; dividing the first feature map into a plurality of groups of feature maps according to the channel, and respectively convolving the plurality of groups of feature maps by convolution cores with different sizes to obtain a plurality of groups of second feature maps; inputting a plurality of groups of second feature images obtained after convolution into an ECA attention mechanism to obtain different weights, and multiplying each weight with the corresponding second feature image to obtain third feature images of different groups; and splicing the third feature images of different groups on the channel to obtain the feature images containing the small target information.

In the embodiment of the application, the backbone network is Resnet-50 backbone extraction networks, as shown in fig. 3, and consists of one convolution layer and four residual layers, and an ms_eca attention mechanism is added after the first residual block and the second residual block of Resnet-50, so that extracted small target feature information can be enriched.

In implementation, referring to the schematic diagram of the ms_eca attention mechanism shown in fig. 4, first, the input feature image size isDividing the characteristic images into a plurality of groups according to the size of the characteristic images, dividing the characteristic images into a plurality of groups according to the channel, dividing the characteristic images into a plurality of groups, respectively convoluting the divided characteristic images with convolution kernels with different sizes, ensuring that the final output size is consistent with the input size, ensuring the invariance of the size of the characteristic images by increasing the padding value when convoluting the characteristic images with the large convolution kernels, inputting the characteristic images obtained after convoluting into an ECA attention mechanism to obtain different weights, multiplying the weights by the input characteristic images, and finally splicing and outputting the characteristic images of different groups on the channel, thereby ensuring that the network focuses on the information of small targets more.

The training process of the small target detection model in the embodiment of the application is described in detail as follows:

(1) Acquiring a small target training sample set;

Firstly, acquiring a small target data set; then performing scaling processing and picture data enhancement processing on pictures in the small target data set to obtain a small target training sample set; wherein the scaling process includes: scaling the picture with the real frame width and height value of the small target exceeding the threshold value according to the scaling ratio; the scaling ratio is equal to the minimum value in the target real frame width and height values divided by a preset value; the picture data enhancement processing includes: and (5) rotating, overturning and adjusting HSV values of each picture.

The public dataset FlickrLogos-32 is preprocessed. According to the definition of a small target, reading a target width and height value of a labeling file corresponding to each picture, if the width value or the height value is larger than 32px, scaling the whole picture, aligning the left upper corner of the scaled image with the left upper corner of the original image, filling black into the rest to form a new image as a training set, and then carrying out data enhancement on the data set on each new image by means of rotating, overturning, adjusting the HSV value of the image and the like, wherein the scaling ratio formula is as follows:

；

Wherein ratio represents the scaling ratio; and/> Representing the width and height values of each target individually,Representing the minimum of the wide-high values of each target; 32 is the absolute size definition of a small target.

(2) Performing cluster analysis on the small target training sample set by adopting a K-means++ clustering algorithm, and determining the size and proportion of an anchor frame suitable for the training set;

the specific implementation process can be described with reference to the steps shown in fig. 5, and is implemented by the following steps:

A. Acquiring the area of a real frame corresponding to a small target in each sample in a training set; determining that two real frames belong to one cluster according to the fact that the area difference of the real frames in the two samples is smaller than or equal to a preset area threshold value, otherwise, judging all samples according to the principle that the two real frames do not belong to one cluster, and determining the number of clusters;

For example, firstly, the corresponding annotation file of each picture is read, the real frame width and height values of each target are obtained, the subsequent calculation is convenient, and then the area threshold value is set The method is used for determining the number of clusters, and the specific formula is as follows:

Wherein, And/>Representing the area of the two samples respectively,/>Indicating that the two samples being compared do not belong to a cluster,/>Then it is indicated that the two samples belong to one cluster,/>Indicating whether two samples belong to the same cluster.

B. Carrying out cluster center initialization processing on samples in the training set to obtain initialized cluster centers with the same number as the number of clusters;

randomly selecting a sample from the training set as the center of the current cluster; a cluster center determining step is carried out based on the current cluster center: calculating IoU values between each sample and the current cluster center, taking the sample corresponding to the minimum IoU value as a new current cluster center, continuing to execute the cluster center determining step until the number of the determined current cluster centers reaches the number of clusters, and determining the determined current cluster centers as initialized cluster centers.

IoU values can be calculated as follows:

；

Wherein, Representing intersections,/>Representing union, ioU is the area intersection ratio,Respectively representing the upper left corner coordinate value and the lower right corner coordinate value of a combined frame formed by two real frames,Representing the areas of the two real boxes, respectively.

C. updating the plurality of initialized cluster centers to obtain a plurality of updated cluster centers;

In particular implementations, the sample partitioning step may be performed based on a plurality of initialized cluster centers: calculating IoU values of the samples and each initialized cluster center for each sample; the sample is attributed to a cluster corresponding to an initialization cluster center with the maximum IoU value; updating the cluster center in each cluster according to the median value in the cluster; the median value is a sample of the plurality of samples in the cluster, which corresponds to a IoU value in the middle position in IoU values of the center of the cluster; the sample partitioning step is continued based on the updated cluster centers until the samples in each cluster are no longer changing, and the cluster center in each cluster at this time is determined as the updated cluster center.

D. And determining the size and proportion of the anchor frame applicable to the training set based on the target real frames respectively corresponding to the plurality of updated clustering centers. The size and proportion of the anchor frame suitable for the training set are determined, for example, 9 clustering centers are clustered finally, and then the 9 clustering centers are anchor frame information, for example (width, height): (5, 5), (15, 15), (20, 20), (10, 10), (30, 30), (40, 40), (20, 20), (60, 60), (90, 90), then the final dimensions and proportions are the dimensions: (10, 10), (30, 30), (40, 40), ratio: (1:1), (1:2), (2:1). After the clustering result is analyzed, the size and the final size of the proportion of all the clustering centers are selected as far as possible. In the embodiment of the application, any number can be selected, and the proportion is according to the number of preset anchor frames, and finally, the method comprises the following steps: the number of clustering centers/preset anchor frames is obtained, if the clustering center is 12, the number of sizes can be selected: 1.2, 3,4, 6, 12, corresponding proportional numbers: 12.6, 4, 3,2, 1; the number of the clustering centers is obtained by multiplying the number of the sizes and the number of the proportions.

(3) Inputting samples in the small target training sample set into an initial detection model corresponding to the small target detection model; the initial detection model is Faster R-CNN with improved structure.

(4) Extracting sample characteristics through a characteristic extraction network added with an MS-ECA attention mechanism in an initial detection model; processing the sample characteristics through a regional suggestion network configured with the size and the proportion of the anchor frame to obtain a prediction frame; determining a loss value based on a loss function including an area ratio and an aspect ratio of the prediction frame and the real frame; i.e. the loss value is calculated from the candidate box and the real box. The specific calculation process is as follows:

The loss value AWH-IoU is calculated according to the following equation:

；

wherein IoU represents the area intersection ratio between the candidate frame and the real frame in the output result; and/> Representing the central coordinate values of the candidate frame and the real frame respectively,/>Representing the center distance of the candidate frame and the real frame,Representing the diagonal distance that can encompass the minimum closed region of the candidate and real frames,/>And/>Representing the areas of the candidate and real boxes, respectively,/>，/>The width and height values of the real frame and the candidate frame are respectively represented; /(I)Representing an exponential function.

(5) And adjusting parameters of the initial detection model based on the loss value until the initial detection model converges to obtain a small target detection model.

The above process is a model training process, and when the model is applied, a plurality of candidate frames can be output based on the extracted target features through the regional suggestion network, and the candidate frames are filtered to obtain a prediction frame for calculation; and adjusting the prediction frame through the region-of-interest layer, and obtaining a target probability score and a bounding box regression score after the prediction frame passes through the full-connection layer so as to obtain a small target detection frame and a class corresponding to the picture to be detected.

The specific process is as follows:

The RPN network (such as the regional proposal network in fig. 2) filters the candidate boxes by using NMS (Non-Maximum Suppression, non-maximal suppression algorithm), and the NMS uses IoU value between the anchors as a filtering condition, and the embodiment of the present application further introduces the area ratio and the aspect ratio of the candidate boxes to the real boxes as new penalty items on the basis of the filtering condition. First, a group of candidate frames are generated through a detection model, each candidate frame has its coordinate information and confidence score, the generated candidate frames are arranged in descending order according to the scores of their confidence scores, and the AWH_ IoU values of all the candidate frames are calculated. The candidate boxes are then sorted in descending order according to the resulting awh_ IoU value, the candidate box with the highest awh_ IoU value is selected and added to the final result list, the remaining candidate boxes are traversed, the awh_ IoU values of the remaining candidate boxes and the selected candidate boxes are calculated, if the awh_ IoU value of the remaining candidate boxes and any selected candidate boxes is greater than a set threshold (0.5), the candidate boxes are discarded, otherwise they are added to the final list. The above process is repeated until all candidate boxes have been processed. And finally, the candidate frames contained in the final result list are output prediction frames. And then the prediction frame is adjusted through the region-of-interest layer, and the target probability score and the bounding box regression score are obtained after the prediction frame passes through the full-connection layer, so that the small target detection frame and the class corresponding to the picture to be detected are obtained.

In the small target detection method provided by the embodiment of the application, the training sets are clustered through K-means++, different numbers of clustering centers are respectively set, and the clustering result shows that as the number of clusters increases, the average IoU value of a real frame also increases continuously, which means that when the number of the clustering centers is set, the distribution situation of a sample can be better reflected by using a larger K value. Finally, after the K value is set to 12, anchor frame information with four sizes and three different proportions can be obtained, and compared with anchor frame information with three sizes and three different proportions obtained through experience of the original fast R-CNN, the anchor frame design method provided by the embodiment of the application can be more suitable for anchor frame design of small targets.

In addition, after Resnet-50 layers of the first layer and the second layer of residual modules, through adding an MS_ECA attention mechanism, the feature extraction network can pay attention to the information of the small target more effectively, so that the omission ratio of the small target is reduced, and after using the AWH_ IoU with larger punishment force, the RPN can filter candidate frames, in consideration of the area ratio and the aspect ratio between the frames, so that the candidate frames with larger IoU values but poor effects can be filtered more effectively, and the detection precision of the small target is improved.

Small object detection is a challenging problem in the field of computer vision, typically involving identifying and locating very small-sized object objects in an image or video. These small targets may be difficult to accurately detect due to low resolution, occlusion, or noise. Existing small target detection methods typically rely on complex neural network architectures and large amounts of training data. Thus, there is a need for a more efficient, accurate and scalable small target detection system and method. The small target detection method based on the multi-scale attention mechanism provided by the embodiment of the application has good expandability while improving the detection accuracy. The following is the main concept of the method:

1. Small target enhancement techniques: the embodiment of the application provides a technology for enhancing small target feature information, which can enhance target features in low-resolution or blurred images in a main extraction network part so as to improve detection performance;

2. multi-scale detection: the small target detection model provided by the embodiment of the application can support multi-scale target detection so as to ensure that small targets can be accurately detected even under different distances and sizes;

3. Rich training samples: according to the embodiment of the application, the original data set is subjected to data enhancement by using methods such as overturning and rotating, so that the training set is expanded, and the robustness of the model is improved;

4. more efficient candidate box screening techniques: except considering the influence brought by IoU, the embodiment of the application increases the area ratio and the aspect ratio between the candidate frames as new penalty items, so that the candidate frames with poor effects can be filtered out more effectively in the screening process of the candidate frames.

Feasibility and beneficial effects:

1. All experiments of the embodiment of the application are carried out under a Windows system, pytorch is used as a deep learning framework, and the hardware environment is: the GPU is NVIDIA GTX3050 and is provided with CUDA10.2;

2. the embodiment of the application has higher detection accuracy and particularly has excellent performance in the aspect of small target detection;

3. The MS_ECA attention mechanism and the AWH_ IoU provided by the embodiment of the application can be suitable for various detection models, and have good expansibility;

Experimental results and analysis:

The improved detection model adopts a new neural network architecture, has better feature extraction capability, and the experimental result shown in fig. 6 shows that the detection precision of the improved algorithm on the enhanced FlickrLogos-32 data set reaches 92.7%, compared with the original fast R-CNN, the detection precision of the improved algorithm on small targets can be effectively improved by an MS_ECA attention mechanism and an AWH_ IoU.

Based on the above method embodiments, the present application further provides a small object detection apparatus, where the apparatus includes a plurality of modules for performing the steps of the small object detection method described above, and as shown in fig. 7, the plurality of modules includes a picture acquisition module 82, a picture input module 84, a feature extraction module 86, and a prediction module 88, where: a picture acquisition module 82, configured to acquire a picture to be detected including a small target; the picture input module 84 is configured to input a picture to be detected into the small target detection model; the small target detection model comprises a feature extraction network added with an MS-ECA attention mechanism, and comprises a region suggestion network for presetting anchor frame sizes and proportions and adding a loss function of an area ratio penalty term and an aspect ratio penalty term; the size and the proportion of the preset anchor frame are anchor frames obtained by carrying out cluster analysis on the small target training sample set through a clustering algorithm; the feature extraction module 86 is configured to extract a feature map containing small target information in a picture to be detected through a feature extraction network; a prediction module 88, configured to output a plurality of candidate frames based on the extracted feature map through the regional suggestion network, and perform filtering processing on the plurality of candidate frames to obtain a predicted frame for calculation; and adjusting the prediction frame through the region-of-interest layer, and obtaining a target probability score and a bounding box regression score after the prediction frame passes through the full-connection layer so as to obtain a small target detection frame and a class corresponding to the picture to be detected.

Further, the feature extraction module 86 is configured to extract a first feature map of the picture to be detected through a backbone network in the feature extraction network; dividing the first feature map into a plurality of groups of feature maps according to the channel, and respectively convolving the plurality of groups of feature maps by convolution cores with different sizes to obtain a plurality of groups of second feature maps; inputting a plurality of groups of second feature images obtained after convolution into an ECA attention mechanism to obtain different weights, and multiplying each weight with the corresponding second feature image to obtain third feature images of different groups; and splicing the third feature images of different groups on the channel to obtain the feature images containing the small target information.

Further, the device further comprises a training module, configured to perform the following training process of the small target detection model: acquiring a small target training sample set; performing cluster analysis on the small target training sample set by adopting a K-means++ clustering algorithm, and determining the size and proportion of an anchor frame suitable for the training set; inputting samples in the small target training sample set into an initial detection model corresponding to the small target detection model; extracting sample characteristics through a characteristic extraction network added with an MS-ECA attention mechanism in an initial detection model; processing the sample characteristics through a regional suggestion network configured with the size and the proportion of the anchor frame to obtain a prediction frame; determining a loss value based on a loss function including an area ratio and an aspect ratio of the prediction frame and the real frame; and adjusting parameters of the initial detection model based on the loss value until the initial detection model converges to obtain a small target detection model.

Further, the training module is used for acquiring a small target data set; performing scaling treatment and picture data enhancement treatment on pictures in the small target data set to obtain a small target training sample set; wherein the scaling process includes: scaling the picture with the real frame width and height value of the small target exceeding the threshold value according to the scaling ratio; the scaling ratio is equal to the minimum value in the target real frame width and height values divided by a preset value; the picture data enhancement processing includes: and (5) rotating, overturning and adjusting HSV values of each picture.

Further, the training module is configured to obtain an area of a real frame corresponding to the small target in each sample in the training set; determining that two real frames belong to one cluster according to the fact that the area difference of the real frames in the two samples is smaller than or equal to a preset area threshold value, otherwise, judging all samples according to the principle that the two real frames do not belong to one cluster, and determining the number of clusters; carrying out cluster center initialization processing on samples in the training set to obtain initialized cluster centers with the same number as the number of clusters; updating the plurality of initialized cluster centers to obtain a plurality of updated cluster centers; and determining the size and proportion of the anchor frame applicable to the training set based on the target real frames respectively corresponding to the plurality of updated clustering centers.

Further, the training module is configured to randomly select one sample from the training set as a current cluster center; a cluster center determining step is carried out based on the current cluster center: calculating IoU values between each sample and the current cluster center, taking the sample corresponding to the minimum IoU value as a new current cluster center, continuing to execute the cluster center determining step until the number of the determined current cluster centers reaches the number of clusters, and determining the determined current cluster centers as initialized cluster centers.

Further, the training module is configured to perform the sample division step based on a plurality of initialized cluster centers: calculating IoU values of the samples and each initialized cluster center for each sample; the sample is attributed to a cluster corresponding to an initialization cluster center with the maximum IoU value; updating the cluster center in each cluster according to the median value in the cluster; the median value is a sample of the plurality of samples in the cluster, which corresponds to a IoU value in the middle position in IoU values of the center of the cluster; the sample partitioning step is continued based on the updated cluster centers until the samples in each cluster are no longer changing, and the cluster center in each cluster at this time is determined as the updated cluster center.

Further, the training module is configured to calculate the loss value AWH-IoU according to the following formula:

；

The device provided by the embodiment of the present application has the same implementation principle and technical effects as those of the foregoing method embodiment, and for the sake of brief description, reference may be made to the corresponding content in the foregoing method embodiment where the device embodiment is not mentioned.

The embodiment of the application also provides a computer readable storage medium, which stores computer executable instructions that, when being called and executed by a processor, cause the processor to implement the above method, and the specific implementation can refer to the foregoing method embodiment and will not be described herein.

The method, the apparatus and the computer program product of the electronic device provided in the embodiments of the present application include a computer readable storage medium storing program codes, where the instructions included in the program codes may be used to execute the method described in the foregoing method embodiment, and specific implementation may refer to the method embodiment and will not be described herein.

The relative steps, numerical expressions and numerical values of the components and steps set forth in these embodiments do not limit the scope of the present application unless it is specifically stated otherwise.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer readable storage medium executable by a processor. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In the description of the present application, it should be noted that the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

Finally, it should be noted that: the above examples are only specific embodiments of the present application, and are not intended to limit the scope of the present application, but it should be understood by those skilled in the art that the present application is not limited thereto, and that the present application is described in detail with reference to the foregoing examples: any person skilled in the art may modify or easily conceive of the technical solution described in the foregoing embodiments, or perform equivalent substitution of some of the technical features, while remaining within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of small target detection, the method comprising:

acquiring a picture to be detected containing a small target;

Inputting the picture to be detected into a small target detection model; the small target detection model includes: a feature extraction network added with an MS-ECA attention mechanism, comprising a region suggestion network of a loss function of a preset anchor frame size and proportion and a newly added area ratio penalty term and an aspect ratio penalty term; the size and the proportion of the preset anchor frame are obtained by carrying out cluster analysis on a small target training sample set through a clustering algorithm;

Extracting a feature map containing small target information in the picture to be detected through the feature extraction network;

Outputting a plurality of candidate frames based on the extracted feature map through the regional suggestion network, and filtering the candidate frames to obtain a prediction frame for calculation;

and adjusting the prediction frame through the region-of-interest layer, and obtaining a target probability score and a bounding box regression score after the prediction frame passes through the full-connection layer so as to obtain a small target detection frame and a class corresponding to the picture to be detected.

2. The method according to claim 1, wherein the step of extracting, through the feature extraction network, a feature map containing small object information in the picture to be detected includes:

extracting a first feature map of the picture to be detected through a backbone network in the feature extraction network;

Dividing the first feature map into a plurality of groups of feature maps according to the channel, and respectively convolving the plurality of groups of feature maps by convolution cores with different sizes to obtain a plurality of groups of second feature maps;

Inputting a plurality of groups of second feature images obtained after convolution into an ECA attention mechanism to obtain different weights, and multiplying each weight with the corresponding second feature image to obtain third feature images of different groups;

And splicing the third feature images of different groups on the channel to obtain the feature images containing the small target information.

3. The method of claim 1, wherein the training process of the small target detection model is as follows:

Acquiring a small target training sample set;

performing cluster analysis on the small target training sample set by adopting a K-means++ clustering algorithm, and determining the size and proportion of an anchor frame suitable for the training set;

inputting the samples in the small target training sample set into an initial detection model corresponding to the small target detection model;

extracting sample characteristics through a characteristic extraction network added with an MS-ECA attention mechanism in the initial detection model; processing the sample characteristics through a regional suggestion network configured with the anchor frame size and the proportion to obtain a prediction frame; determining a loss value based on a loss function including an area ratio and an aspect ratio of the prediction frame and the real frame;

and adjusting parameters of the initial detection model based on the loss value until the initial detection model converges to obtain a small target detection model.

4. A method according to claim 3, wherein the step of obtaining a small target training sample set comprises:

Acquiring a small target data set;

performing scaling treatment and picture data enhancement treatment on the pictures in the small target data set to obtain a small target training sample set;

Wherein the scaling process includes: scaling the picture with the real frame width and height value of the small target exceeding the threshold value according to the scaling ratio; the scaling ratio is equal to the minimum value in the target real frame width and height values divided by a preset value; the picture data enhancement processing includes: and (5) rotating, overturning and adjusting HSV values of each picture.

5. A method according to claim 3, wherein the step of performing cluster analysis on the small target training sample set using a K-means++ cluster algorithm to determine the size and scale of the anchor frame applicable to the training set comprises:

acquiring the area of a real frame corresponding to a small target in each sample in a training set;

Determining that two real frames belong to one cluster according to the fact that the area difference of the real frames in the two samples is smaller than or equal to a preset area threshold value, otherwise, judging all samples according to the principle that the two real frames do not belong to one cluster, and determining the number of clusters;

Carrying out cluster center initialization processing on the samples in the training set to obtain initialized cluster centers with the same number as the number of clusters;

Updating the plurality of initialized cluster centers to obtain a plurality of updated cluster centers;

and determining the size and proportion of the anchor frame applicable to the training set based on the target real frames respectively corresponding to the plurality of updated clustering centers.

6. The method of claim 5, wherein the step of initializing cluster centers for samples in the training set to obtain the same number of initialized cluster centers as the number of clusters comprises:

Randomly selecting a sample from the training set as a current cluster center; and carrying out a cluster center determining step based on the current cluster center:

Calculating IoU values between each other sample and the current cluster center, taking the sample corresponding to the minimum IoU value as a new current cluster center, continuing to execute the cluster center determining step until the number of the determined current cluster centers reaches the number of clusters, and determining the determined current cluster center as an initialized cluster center.

7. The method of claim 5, wherein the step of updating the plurality of initialized cluster centers to obtain a plurality of updated cluster centers comprises:

Performing a sample partitioning step based on the plurality of initialized cluster centers:

Calculating IoU values of the samples and each initialized cluster center for each sample;

Attributing the sample to a cluster corresponding to an initialization cluster center with the maximum IoU value;

Updating the cluster center in each cluster according to the median value in the cluster; the median value is a sample of a plurality of samples in the cluster, which corresponds to a IoU value in the middle position in IoU values at the center of the cluster;

The sample partitioning step is continued based on the updated cluster centers until the samples in each cluster are no longer changing, and the cluster center in each cluster at this time is determined as the updated cluster center.

8. A method according to claim 3, wherein the step of determining a loss value based on a loss function comprising the aspect ratio and the area ratio of the prediction box and the real box comprises:

The loss value AWH-IoU is calculated according to the following equation:

；

wherein IoU represents the area intersection ratio between the predicted frame and the real frame in the output result; and/> Representing the central coordinate values of the predicted frame and the real frame respectively,/>Representing the center distance of the predicted and real frames,Representing the diagonal distance that can encompass the smallest closed region of the prediction and real frames,/>And/>Representing the areas of the prediction and real frames, respectively,/>And/>The width and height values of the real frame and the predicted frame are respectively represented; /(I)Representing an exponential function.

9. A small object detection device, characterized in that the device comprises a plurality of modules for performing the steps of the small object detection method of any one of claims 1 to 8, the plurality of modules comprising a picture acquisition module, a picture input module, a feature extraction module and a prediction module, wherein:

The image acquisition module is used for acquiring an image to be detected containing a small target;

the picture input module is used for inputting the picture to be detected into a small target detection model; the small target detection model comprises a feature extraction network added with an MS-ECA attention mechanism, and comprises a region suggestion network for presetting anchor frame sizes and proportions and adding loss functions of an area ratio penalty term and an aspect ratio penalty term; the size and the proportion of the preset anchor frame are obtained by carrying out cluster analysis on a small target training sample set through a clustering algorithm;

The feature extraction module is used for extracting a feature map containing small target information in the picture to be detected through the feature extraction network;

the prediction module is used for outputting a plurality of candidate frames based on the extracted feature images through the regional suggestion network, and filtering the candidate frames to obtain a prediction frame for calculation; and adjusting the prediction frame through the region-of-interest layer, and obtaining a target probability score and a bounding box regression score after the prediction frame passes through the full-connection layer so as to obtain a small target detection frame and a class corresponding to the picture to be detected.

10. A computer readable storage medium storing computer executable instructions which, when invoked and executed by a processor, cause the processor to implement the method of any one of claims 1 to 8.