CN116188929A

CN116188929A - Small target detection method and small target detection system

Info

Publication number: CN116188929A
Application number: CN202310115681.5A
Authority: CN
Inventors: 于瑞云; 赵前程
Original assignee: Northeastern University China
Current assignee: Northeastern University China
Priority date: 2023-02-14
Filing date: 2023-02-14
Publication date: 2023-05-30

Abstract

The invention belongs to the technical field of computers, and particularly relates to a small target detection method and a small target detection system. The small target detection method comprises the following steps: adjusting the fusion proportion of the upper layer feature images in the two adjacent layers of feature images by using an attention mechanism to obtain a fusion feature image; and carrying out small target detection based on the fusion feature map to obtain a small target detection result. According to the detection method, the fusion proportion of the upper-layer feature map is adjusted by using the attention in the feature fusion process, so that semantic features suitable for the position of the small target are screened out, and the detection precision of the small target is greatly improved.

Description

Small target detection method and small target detection system

Technical Field

The application belongs to the technical field of computers, and particularly relates to a small target detection method and a small target detection system.

Background

In recent years, object detection plays a great role in various fields as an important branch of computer vision. In a real scene, due to the fact that a large number of small targets exist, the small target detection has wide application prospects in the fields of medical treatment, intelligent transportation, intelligent retail, security criminal investigation, national defense safety and the like. The accuracy of object detection in the object detection task plays a crucial role in event triggering, but the detection accuracy of small objects is not satisfactory yet, and becomes a great difficulty in the industry.

The small target has the basic characteristics of small pixel occupation ratio, small coverage area, small information and the like, and is the root cause of difficult detection of the small target in the image. In the existing small target detection method, a single-stage target detection algorithm adopts a multi-scale detection method, the characteristic diagram of the upper layer provided by a main network is firstly subjected to 1×1 convolution to adjust the number of unified channels, then bilinear interpolation is carried out to change H×W of the characteristic diagram into the same size as the characteristic diagram of the lower layer, and finally, the interpolated characteristic diagram and the characteristic diagram of the lower layer are simply added and fused. Because the deep semantic features not only contain the semantic information of the small target but also contain the semantic information of the medium/large target, if simple addition fusion is carried out, the semantic information of the small target is fused, the semantic information of the medium/large target is fused, the two kinds of semantic information which are useless for small target detection are introduced into the feature map aiming at the small target detection level, interference noise is added, and the small target detection precision is low.

Therefore, how to effectively detect the small target and improve the detection accuracy of the small target becomes a technical problem to be solved.

Disclosure of Invention

First, the technical problem to be solved

In view of the foregoing drawbacks and disadvantages of the prior art, the present application provides a small target detection method and a small target detection system.

(II) technical scheme

In order to achieve the above purpose, the present application adopts the following technical scheme:

in a first aspect, an embodiment of the present application provides a small target detection method based on an attention adaptive fusion feature, where the method includes:

adjusting the fusion proportion of the upper layer feature images in the two adjacent layers of feature images by using an attention mechanism to obtain a fusion feature image;

and carrying out small target detection based on the fusion feature map to obtain a small target detection result.

Optionally, the method comprises the steps of:

s1, acquiring an image to be detected;

s2, inputting the image to be detected into a pre-trained small target detection model to obtain a corresponding small target detection result; the small target detection model comprises a main network module for extracting a multi-scale feature map, a feature fusion module for carrying out feature fusion on the multi-scale feature map by using an attention mechanism to adjust the fusion proportion of upper-layer feature maps in two adjacent layers of feature maps, and a detection head module for carrying out small target detection on the fused feature maps.

Optionally, the step of extracting the multi-scale feature map by the backbone network module includes:

and firstly, carrying out 7×7 convolution with a step length of 2 and a 2×2 maximum pooling layer with a step length of 2 on the image to be detected, and respectively carrying out repeated stacking residual blocks with different numbers to obtain a C2 characteristic diagram, a C3 characteristic diagram, a C4 characteristic diagram and a C5 characteristic diagram with the sizes of 1/4, 1/8, 1/16 and 1/32 of the original figures, wherein the residual blocks consist of 1×1 convolution and 3×3 convolution.

Optionally, the method for performing feature fusion by the feature fusion module on the multi-scale feature map by adjusting the fusion proportion of the upper layer feature map in the two adjacent layers of feature maps through an attention mechanism includes:

the feature maps of adjacent layers are aggregated after the fusion proportion of the upper layer feature maps is adjusted by using an attention mechanism according to the following formula:

wherein P is _i Is a fusion feature map after attention is added,

the convolution process with a 3 x 3 convolution kernel is shown,

representing the convolution process by a 1 x 1 convolution kernel for channel number matching, f _upsample Represent upsampling, C _i A characteristic diagram representing the layer, T _i+1 A feature map representing the upper layer, f _att Representing the addition of attention to the input feature map.

Optionally, attention is added to the input feature map according to the following formula:

f _att (x _in )＝x _in *(sigmoid(conv _1×1 (conv _1×1 (x _in )))

Wherein x is _in Feature map, conv, representing input _1×1 The convolution operation is performed by a 1×1 convolution kernel, and sigmoid represents an activation function.

Optionally, the method for detecting the fusion feature map by the detection head module includes:

and carrying out classification detection on the fusion feature map according to the following formula to obtain the class probability of each prediction boundary box at each spatial position, and expanding the square receptive field into the square receptive field and the rectangular receptive field through the receptive field self-adaptive selection module in the detection process:

f _output1 ＝conv _3×3 (Rconv _3×3 (RFASM(RFASM(Rconv _3×3 (x)))))

regression detection is carried out on the fusion feature map according to the following formula, and the offset of each reference anchor frame at each spatial position is obtained to determine the position of the prediction boundary frame:

f _output2 ＝conv _3×3 (Rconv _3×3 (Rconv _3×3 (Rconv _3×3 (Rconv _3×3 (x)))))

wherein f _output1 Representing class detectionPrediction result, f _output2 Representing the prediction result of regression detection, x represents the input feature map, conv _3×3 Indicating that the 3 x 3 convolution kernel is convolved with Rconv _3×3 Representing a 3 x 3 convolution followed by a ReLU activation function, RFASM represents a receptive field adaptive selection module.

Optionally, the receptive field adaptive selection module expands the square receptive field to a square receptive field and a rectangular receptive field according to the following formula:

f _next ＝conv _1×1 (m)+m ₁

wherein f _next Representing the output result of the receptive field self-adaptive selection module, wherein m is represented by m ₂ 、m ₃ 、m ₄ 、m ₅ Normalization results in:

wherein m is ₁ 、m ₂ 、m ₃ 、m ₄ 、m ₅ Calculated according to the following formula:

m ₁ ＝conv _3×3 (x)

m ₂ ＝conv _1×3 (conv _1×1 (x))

m ₃ ＝conv _3×1 (conv _1×1 (x))

m ₄ ＝conv _1×3 (conv _1×3 (conv _1×1 (x)))

m ₅ ＝conv _3×1 (conv _3×1 (conv _1×1 (x)))

w ₂ 、w ₃ 、w ₄ 、w ₅ the w obtained by channel splitting is calculated according to the following formula:

w＝conv _1×1 (m ₂ +m ₃ +m ₄ +m ₅ )

wherein x represents an input feature map, conv _1×3 Representing a 1 x 3 convolution kernel convolution process, conv _3×1 The 3×1 convolution kernel convolution process is represented, and epsilon takes a value of 0.0001.

Optionally, after S1, before S2, the method further includes:

preprocessing the image to be detected to obtain a preprocessed image to be detected, wherein the preprocessing comprises one or more of filling, changing the size of the image and enhancing data.

Optionally, S2 further includes:

and S3, deleting the overlapped detection frames through post-processing to obtain a final small target detection result.

In a second aspect, embodiments of the present application provide a small target detection system, where the system includes a video acquisition subsystem and a small target detection subsystem;

the video acquisition subsystem is connected with the small target detection subsystem and is used for acquiring video images of a target area through video acquisition equipment and transmitting the video images of the target area to the small target detection subsystem;

the small target detection subsystem is used for receiving the video image and carrying out real-time small target detection on the video image by adopting the small target detection method based on the attention self-adaptive fusion characteristic according to any one of the first aspect.

In a third aspect, embodiments of the present application provide a computer readable storage medium storing a small object detection program based on an attention adaptive fusion feature, where the small object detection program when executed by a processor causes the processor to perform the steps of the small object detection method based on an attention adaptive fusion feature as set forth in any one of the first aspects above.

In a fourth aspect, embodiments of the present application provide a computer device, including a memory and a processor, where the memory stores a small object detection program based on an attention adaptive fusion feature, where the small object detection program when executed by the processor causes the processor to perform the steps of the small object detection method based on an attention adaptive fusion feature as set forth in any one of the first aspects above.

(III) beneficial effects

The beneficial effects of this application are: the application provides a small target detection method and a small target detection system, wherein the small target detection method comprises the following steps: adjusting the fusion proportion of the upper layer feature images in the two adjacent layers of feature images by using an attention mechanism to obtain a fusion feature image; and carrying out small target detection based on the fusion feature map to obtain a small target detection result.

According to the small target detection method, semantic features of the position of the small target are screened out through attention in the process of feature fusion of the upper layer and the lower layer, so that feature images aiming at detecting the small target layer only fuse the semantic features of the small target, and the accuracy of small target detection is greatly improved; and the deployment is simple and convenient, and plug and play is realized.

Further, in the feature detection process, the receptive field self-adaptive selection module RFASM uses a attention weighting mode to expand a rectangular receptive field, so that a network can self-adaptively select which proper receptive field is required to be used for capturing a target for an object at the current position, the capability of capturing a data set and a plurality of extremely high or extremely wide small targets in a living scene is enhanced, and further the accuracy of small target detection is greatly improved.

Furthermore, the small target detection method provided by the application adopts the form of attention, and the weight of the attention module is changed in the training process according to different data sets, so that the application range is enlarged.

Drawings

The application is described with the aid of the following figures:

FIG. 1 is a flow chart of a small object detection method based on an attention adaptive fusion feature according to an embodiment of the present application;

FIG. 2 is a flow chart of a small object detection method based on an attention adaptive fusion feature according to another embodiment of the present application;

FIG. 3 is a schematic diagram of a training process of a small target detection model according to another embodiment of the present application;

FIG. 4 is a schematic diagram of a small object detection model in another embodiment of the present application;

FIG. 5 is a feature fusion flow chart in another embodiment of the present application;

FIG. 6 is a schematic diagram illustrating a process flow of the attention module to the input feature map according to another embodiment of the present application;

FIG. 7 is a block diagram of a receptive field adaptive selection module in accordance with another embodiment of the application;

FIG. 8 is a diagram of the prior art RetinaNet model after first-tier feature fusion of the FPN;

FIG. 9 is a diagram of a small object detection model after feature fusion of the first layer of FPN in another embodiment of the present application;

FIG. 10 is a schematic diagram of a small object detection system in one embodiment of the present application;

FIG. 11 is a schematic architecture diagram of a computer device in one embodiment of the present application.

Detailed Description

The invention will be better explained by the following detailed description of the embodiments with reference to the drawings. It is to be understood that the specific embodiments described below are merely illustrative of the related invention, and not restrictive of the invention. In addition, it should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other; for convenience of description, only parts related to the invention are shown in the drawings.

Although different scenes have different definitions of small objects, unified standards are not formed at present, and the definition modes of the existing small objects are mainly divided into two types: definition based on relative scale and definition based on absolute scale. Objects with a median between 0.08% and 0.58% of the ratio of bounding box area and image area are considered small targets in the definition based on relative scale. In the definition based on absolute scale, a target with a resolution of less than 32×32 pixels is regarded as a small target. In the small target detection method, the small target refers to the small target defined in an absolute scale.

Example 1

Fig. 1 is a flow chart of a small target detection method based on an attention adaptive fusion feature in an embodiment of the present application, as shown in fig. 1, the small target detection method based on the attention adaptive fusion feature in the embodiment includes the following steps:

s1, adjusting the fusion proportion of upper-layer feature images in two adjacent layers of feature images by using an attention mechanism to obtain a fusion feature image;

s2, small target detection is carried out based on the fusion feature map, and a small target detection result is obtained.

According to the small target detection method, the semantic features of the position of the small target are screened out through attention in the feature fusion process, so that the feature map aiming at the small target detection layer only fuses the semantic features of the small target, and the small target detection precision is greatly improved.

The small target detection method of the embodiment can be applied to computer equipment. The small target detection method of the embodiment can be executed by the computer device through a software system. The type of computer device may be a notebook computer, a server, etc. The specific type of computer device is not particularly limited in this application.

It can be appreciated that the small object detection method of the present embodiment may be performed solely by the client device or the server device, or may be performed by the client device and the server device in cooperation. The server may be a cloud built by a single server or server machine.

For example, the small object detection method may be integrated with the client. After receiving the small target detection request, the client device can execute small target detection through its own hardware environment.

For another example, the small object detection method may be integrated with the server device. After receiving the small target detection request, the server device can execute the small target detection method through its own hardware environment.

In order to better understand the present invention, the implementation process of the method of the present embodiment will be described below with the server as the execution body.

In this embodiment, the small target detection is performed by a small target detection model trained in advance. The small target detection method comprises the following implementation processes:

acquiring an image to be detected;

inputting an image to be detected into a pre-trained small target detection model to obtain a corresponding small target detection result; the small target detection model comprises a main network module for extracting a multi-scale feature map, a feature fusion module for carrying out feature fusion on the multi-scale feature map based on attention, and a detection head module for carrying out small target detection on the fused feature map.

In this embodiment, the acquired image to be detected may be an image obtained by cutting frames from a video image, and then small target detection is performed on each frame of the video. The video image may be a video image acquired in real time by a video acquisition device located in the environment of the small target object, or may be a pre-recorded video image read from a video storage system.

For example, the video capturing device and the server executing the method of the present embodiment may establish a communication connection through a wireless network or a wired network, and the server receives the video image sent by the video capturing device through the established communication connection.

When the video image is a video stream acquired in real time, acquiring the image to be detected may include: and acquiring a key frame image. Specifically, the key frame image may be extracted by the following steps:

decoding the acquired real-time video stream to obtain multi-frame original frame images corresponding to the real-time video stream;

extracting a key frame image from an original frame image based on a preset frame extraction rule;

and taking the extracted key frame image as an image to be detected.

Illustratively, the number of frames of video is typically 30 frames/second, which can be reduced to a minimum of 25 frames/second, so that the real-time video stream can be decoded based on its number of frames. Specifically, the monitoring video can be received in real time and simultaneously decoded, so that a multi-frame original frame image corresponding to the real-time video stream is obtained. After the multi-frame original frame image is acquired, in view of the limitation of processing resources and the requirement of a target service, processing is not required to be performed on each frame of original frame image, so that the frame image can be extracted from the multi-frame original frame image as a key frame image based on a certain frame extraction rule, such as a rule of extracting one frame every 10 frames or extracting one frame every 5 frames.

It should be noted that, the interval of the frame extraction may be set according to practical situations, which is not limited in this application.

In this embodiment, the backbone network module in the small target detection model may employ ResNet50 and ResNet101.

It should be noted that, other neural networks may be adopted as the backbone network, and the neural network structure of the backbone network module is not specifically limited in this embodiment.

The step of extracting the multi-scale feature map by the backbone network module comprises the following steps:

and firstly, carrying out 7×7 convolution with a step length of 2, carrying out 2×2 maximum pooling layer with a step length of 2, and respectively carrying out repeated stacking residual blocks with different numbers to obtain a C2 characteristic diagram, a C3 characteristic diagram, a C4 characteristic diagram and a C5 characteristic diagram with the sizes of 1/4, 1/8, 1/16 and 1/32 of the original figures, wherein the residual blocks consist of 1×1 convolution and 3×3 convolution.

In this embodiment, the method for feature fusion by the feature fusion module to adjust the fusion proportion of the upper layer feature map in the two adjacent layers of feature maps by using the attention mechanism includes:

and (3) adjusting the fusion proportion of the upper layer feature map by using an attention mechanism according to a formula (1), and then aggregating the feature maps of adjacent layers:

wherein P is _i Is a fusion feature map after attention is added,

The convolution process with a 3 x 3 convolution kernel is shown,

Specifically, attention is added to the inputted feature map according to formula (2):

f _att (x _in )＝x _in *(sigmoid(conv _1×1 (conv _1×1 (x _in ))) (2)

And attention is added to the input feature images through the formula (2), so that irrelevant semantic information is prevented from being added to the feature images of the level of the special detection small targets, and the accuracy of the detection of the small targets is improved.

Example two

The present embodiment provides a feature detection head module for expanding a rectangular receptive field by a receptive field adaptive selection module based on the first embodiment.

In this embodiment, the method for detecting the fused feature map by the feature detection head module may include:

the classification detection head performs classification detection on the fusion feature map according to the formula (3) to obtain the class probability of each prediction boundary box at each spatial position, and the square receptive field is expanded to the square receptive field and the rectangular receptive field through the receptive field self-adaptive selection module in the detection process:

f _output1 ＝conv _3×3 (Rconv _3×3 (RFASM(RFASM(Rconv _3×3 (x))))) (3)

Regression detection is carried out on the fusion feature map by the regression detection head according to the formula (4), and the offset of each reference anchor frame at each space position is obtained to determine the position of the prediction boundary frame:

f _output2 ＝conv _3×3 (Rconv _3×3 (Rconv _3×3 (Rconv _3×3 (Rconv _3×3 (x)))))(4)

wherein f _output1 Representing the prediction result of the classification detection, f _output2 Representing the prediction result of regression detection, x represents the input feature map, conv _3×3 Indicating that the 3 x 3 convolution kernel is convolved with Rconv _3×3 Representing a 3 x 3 convolution followed by a ReLU activation function, RFASM represents a receptive field adaptive selection module.

Specifically, the detection head module comprises a receptive field adaptive selection module RFASM, wherein the RFASM expands the square receptive field into the square receptive field and the rectangular receptive field according to the formula (5):

f _next ＝conv _1×1 (m)+m ₁ (5)

wherein f _next Representing the output result of the receptive field adaptive selection module, m is represented by m ₂ 、m ₃ 、m ₄ 、m ₅ Normalization results in:

wherein m is ₁ 、m ₂ 、m ₃ 、m ₄ 、m ₅ Calculated according to formulas (7) - (11):

m ₁ ＝conv _3×3 (x)(7)

m ₂ ＝conv _1×3 (conv _1×1 (x))(8)

m ₃ ＝conv _3×1 (conv _1×1 (x))(9)

m ₄ ＝conv _1×3 (conv _1×3 (conv _1×1 (x)))(10)

m ₅ ＝conv _3×1 (conv _3×1 (conv _1×1 (x)))(11)

w ₂ 、w ₃ 、w ₄ 、w ₅ is obtained by splitting w according to channels, and since w is the weight of h multiplied by w multiplied by 4 and 4 is the number of channels, 4 weights of h multiplied by w multiplied by 1 can be obtained according to channel splitting, respectively w ₂ 、w ₃ 、w ₄ 、w ₅ And w is calculated according to formula (12):

w＝conv _1×1 (m ₂ +m ₃ +m ₄ +m ₅ )(12)

wherein x represents an input feature map, conv _1×3 Representing a 1 x 3 convolution kernel convolution process, conv _3×1 Representing a 3 x 1 convolution kernel convolution process, epsilon takes 0.0001 preventing denominator from being 0.

The feature map input by formula (5) expands the rectangular receptive field by using 1×3 and 3×1 convolutions, and replaces the 1×5, 5×1 convolutions by stacking two 1×3, 3×1 convolutions. And all the 1 multiplied by 3 and 3 multiplied by 1 convolutions are depth separable convolutions, so that not only can a larger receptive field be obtained and a very high or very narrow small target in a scene be considered, but also the calculation and storage resources can be saved. In addition, the receptive field self-adaptive selection module RFASM uses the attention weighting mode, so that the network can self-adaptively select which proper receptive field is needed to be used for capturing the target for the object at the current position, and further the detection precision of the network model on the small target is enhanced.

According to the small target detection method in the embodiment, the feature fusion process is regarded as an encoder, the feature detection is regarded as a decoder, more small target information is added into the feature map by refining the encoding capability of the encoder, and meanwhile, the decoding capability of the decoder is enhanced, so that the detection capability of an algorithm model on the small target is improved. Specifically, semantic features of the position of the small target are screened out through attention in the feature fusion process, so that feature images aiming at detecting small target layers only fuse the semantic features of the small target, meanwhile, in the feature detection process, a rectangular receptive field is expanded through a receptive field self-adaptive selection module to capture a data set and a plurality of extremely high or extremely wide small targets in life scenes, and further, the detection precision of the small targets is greatly improved, and the detection performance of a model is improved.

Example III

Fig. 2 is a flow chart of a small target detection method based on an attention adaptive fusion feature according to another embodiment of the present application, as shown in fig. 2, the method includes the following steps:

and step S10, establishing and training a small target detection model.

Fig. 3 is a schematic diagram of a training process of a small target detection model according to another embodiment of the present application, as shown in fig. 3, the training process of the small target detection model according to this embodiment includes:

step S11, a data set is established. And collecting the pictures and labels corresponding to the manually marked pictures, wherein the pictures and the labels are in one-to-one correspondence. The COCO data set is selected in the embodiment, and is the most huge and authoritative data set in the current target detection field, and can evaluate the average precision, the small target precision, the medium target precision and the large target precision of the model.

And step S12, dividing the data set into a training set, a verification set and a test set.

If the dataset is not large, it can be done as per 6:2:2, if the data set is large (e.g. hundreds of thousands of pictures have been reached), the data set may be partitioned according to 8:1: 1.

And step S13, loading data.

The data loading is carried out through the function Dataloader of the data loading provided in the deep learning framework Pytorch, and in the process of data loading, the number of pictures loaded each time, whether random sampling is carried out, the number of working CPUs and the like can be determined.

And S14, preprocessing data.

Since in the task of computer vision, neural networks require that the input image data be uniform in height and width. Thus, there is a need to pre-process the image, including filling and changing the image size. For the task of object detection, because it also needs to be noted in the original image, it also needs to be remapped back to the original image. In addition, the data preprocessing also comprises operations such as data enhancement and the like.

And S15, model forward reasoning. And sending the loaded picture data into the initialized model, and calculating the weights of the data and the model layer by layer to finally obtain a forward reasoning result of the model.

And S16, obtaining a model forward reasoning result, and calculating a loss value.

And calculating a loss value based on the model forward reasoning result and the picture corresponding label by using the defined loss function.

And S17, calculating the gradient of each layer of the network according to the loss value, and reversely propagating to update the network weight.

And calculating the gradient of each layer of the weight in the network according to the loss value, subtracting the learning rate multiplied by the gradient value from the current weight value, and storing the model.

Step S18, judging whether a preset termination condition is met:

if the model has not reached the prescribed round or the loss value is still greater than the threshold value, returning to the step S13;

if the model reaches the specified round or the loss value is smaller than the threshold value, the model weight file is saved, training is terminated, and the process is ended.

In the training process, training an initial small target detection model based on data of a training set, determining model network parameters, and adjusting super parameters of the model by using a verification set to obtain a trained small target detection model; the accuracy of the trained model is then tested on the test set.

Specific testing methods are prior art and will not be described further herein.

And step S20, acquiring an image to be detected, and preprocessing the image to be detected.

And adjusting the image to be detected to a preset size through image size adjustment or image filling adjustment. Typically this size is of a size acceptable to the network model, e.g., 1333 x 800 pixels.

Step S30, the preprocessed image is input into a small target detection model.

And S40, the small target detection model obtains an initial result of small target detection through forward reasoning. And sending the picture data into a small target detection model network for calculation, and obtaining a model forward reasoning result, namely an initial small target detection result.

And S50, deleting the overlapped detection frames through post-processing to obtain a final small target detection result.

Since the model has very many forward reasoning results, if all the prediction results are presented in all frames, many repeated mutually contained prediction frames are obtained, and it is meaningless to have a large number of prediction frames. When the detection is complete, a one-step post-processing is required to complete the filtering of these large number of duplicate nonsensical prediction frames. Post-processing operations may be performed using non-maximum suppression (non maximum suppression, NMS), soft-NMS, etc.

And step S60, remapping the small target detection result and presenting the small target detection result in an original picture by a picture frame.

And remapping the small target detection result after the post-treatment back to the original image, marking the target position and the target category in the original image, and finally storing or displaying the detection result.

The embodiment provides a self-adaptive fusion method based on attention in the feature fusion process, which screens semantic features of the position of a small target by using an attention module through a feature map after linear interpolation, achieves the purpose of eliminating irrelevant middle/large target semantic features, enables feature maps aiming at a small target layer to be detected to only fuse the small target semantic features, and greatly improves the precision of the model on small target detection.

Because the attention module is used, the self-adaptive fusion of the model can be completed in the characteristic fusion process, namely, the model automatically selects the semantic characteristics of the small targets to complete the fusion according to the current data set picture instead of manually designing superparameter and other methods to complete the fusion, and the receptive field self-adaptive selection module for expanding the rectangular receptive field is added to the detection head part, so that the network can self-adaptively capture the small targets with different shapes, and the capability of capturing the targets by the network is greatly improved. Therefore, the method of the embodiment is simple and convenient to deploy, plug and play, and the small target detection precision is further improved.

To verify the technical effect of the method of the present embodiment, target detection was performed using pictures of the test set in the COCO dataset.

TABLE 1

Table 1 is a comparative table of the results of several existing base models and target detection performed on the COCO dataset after addition of the module of the present invention. The data in the table is the target detection accuracy.

It can be seen from the table that the model using the module of the present invention achieves more excellent effects in terms of small target accuracy than the original model and the same model after replacement using the module of the present invention. After the method is added, the precision of the RetinaNet small target rises by 2%, and the detection precision of the FCOS model small target rises by 0.5%, so that the method can obtain better feature fusion effect and feature detection precision than the original model.

In summary, the feature fusion module and the receptive field adaptive selection module provided by the invention can effectively improve the adaptability of the model, improve the effect after feature fusion and detection, encode more small target information into the feature map in the feature fusion stage, decode more information in the feature detection stage, namely predict more small target related results, improve the accuracy of the detection model on small target detection, and realize more excellent small target detection effect.

The structure of the small target detection model in this embodiment will be described below. Fig. 4 is a schematic structural diagram of a small target detection model according to another embodiment of the present application, and as shown in fig. 4, the small target detection model includes: a backbone network (backbone) module, a feature fusion module and a detection head module. The Loss functions are Focal Loss and L1Loss.

The main network is used for extracting features of different scales to obtain feature graphs of multiple scales;

the feature pyramid network is used for realizing feature fusion based on multi-scale learning to obtain fused features;

and the detection head is used for detecting the category of the target and predicting the category sum of the target according to the corresponding position.

The Backbone network of the backhaul adopts ResNet50, the ResNet50 has 4 stages, and the result characteristic diagram of each stage is respectively reduced to 1/4,1/8,1/16 and 1/32 of the original diagram.

Specifically, in this embodiment, the image to be detected is input into the backbone network, 4 feature maps can be obtained as feature maps to be detected of the image to be detected, and the obtained 4 feature maps are respectively denoted as C2, C3, C4, and C5.

The feature fusion module is implemented based on a feature pyramid network (Feature Pyramid Network, FPN).

The feature fusion method comprises the following steps:

and (3) aggregating adjacent feature layers according to a formula (1), upsampling the upper-layer feature map through linear interpolation, adding attention to the upsampled upper-layer feature map according to a formula (2) to obtain an attention-added upper-layer feature map, adding and fusing the obtained attention-added upper-layer feature map with the channel-matched own-layer feature map, and obtaining a fused feature map after adding attention through the addition of the addition and fusion result.

It should be noted that, here, the upsampling may be bilinear interpolation or nearest-neighbor interpolation, and the upsampling method is not specifically limited in this embodiment.

Specifically, for the feature to be detected of the image to be detected output by the backbone network, fig. 5 is a feature fusion flowchart in another embodiment of the present application, where the triangle part is the attention module.

As shown in fig. 5, the process of feature fusion includes:

a1, C extracted from Backbone network of backhaul ₃ 、C ₄ 、C ₅ The channel number of each feature graph is adjusted to be consistent by using 1X 1 convolution to obtain a new feature with the channel number of C, and the new feature graph obtained by convolution is T ₃ 、T ₄ 、T ₅ ；

A2, pair T ₅ The feature map bilinear interpolation is carried out, the feature map after interpolation is filtered by the attention module to useless semantic information, and the feature map after information filtering and T are carried out at the moment ₄ Adding and fusing again to obtain a new characteristic map NT ₄ ；

Fig. 6 is a schematic diagram of a processing flow of the attention module to the input feature map according to another embodiment of the present application, where, as shown in fig. 6, the processing flow of the attention module includes:

the channel number of the feature map of the irrelevant information to be filtered is reduced to 64 by using 1×1 convolution, namely, the feature map is convolved by using 64 convolution check feature maps of 1×1, and the shape of the feature map is H×W×64;

then normalizing the convolved result;

the normalized result channel number is reduced to 1 by 1X 1 convolution to obtain output, and the shape of the output characteristic diagram is H X W X1;

converting the result to be between 0 and 1 by using Sigmoid to form a probability map of a position;

and multiplying the obtained position probability map with the input feature map to obtain a result feature map after filtering the garbage, and outputting the result feature map.

A3, similar to NT ₄ Also after the same operation of the step 2, the sum T is obtained ₃ New feature map NT after fusion ₃ ；

A4, for the obtained T ₅ 、NT ₄ 、NT ₃ The characteristic diagram is firstly subjected to 3X 3 convolution and smooth fusion, and then the characteristic diagram after convolution is subjected to secondary filtering by an attention module to obtain irrelevant noise information which is not suitable for the task of detecting the target scale of the layer, so as to obtain P ₃ 、P ₄ 、P ₅ ；

A5, P is ₅ The characteristic diagram is convolved with 3X 3 step length of 2 to obtain P ₆ A feature map;

a6, P is ₆ Activating the feature map through a ReLU nonlinear function, and then carrying out 3X 3 convolution with the step length of 2 to obtain a P7 feature map;

a7, P ₃ 、P ₄ 、P ₅ 、P ₆ 、P ₇ The feature map serves as a fused feature map after attention is added.

In the present embodiment, five layers of feature map P ₃ 、P ₄ 、P ₅ 、P ₆ 、P ₇ Input detection head module, can obtain the upper parameter of the characteristic diagram through the classification unitThe probability of the corresponding category of the test frame and the corresponding offset are obtained through the regression unit of the test frame.

The detection head module comprises a classification unit and a detection frame regression unit, wherein the classification unit is used for predicting the class probability (the number of data set classes is K) of each reference frame (Anchor) of each position, and the step is equivalent to predicting the probability of each Anchor on each class, and then taking the maximum probability. And the detection frame regression unit is used for predicting the offset (the offset is 4, so the final result is 4A) between each reference frame and the true calibration frame (group Truth) at each position. The classification unit comprises 2 convolutions of 3×3 (including a ReLU activation function, the number of channels is 256) and two receptive field adaptive selection modules RFASM, and finally, one convolution of 3×3 (not including a ReLU function) is performed, the number of output channels is KA, and finally, the sigmoid activation can obtain the probability that each anchor predicts each category, and each position is equivalent to KA categorization problems. The detection box regression unit is similar to the classification unit, comprising 4 convolutions of 3×3 (containing the ReLU activation function, channel number 256), and finally one convolution of 3×3 (not containing the ReLU function), except that the final output channel number is 4A, which also indicates that the detection box regression is class independent.

Specifically, fig. 7 is a block diagram of a receptive field adaptive selection module according to another embodiment of the application, and as shown in fig. 7, the receptive field adaptive fusion process is as follows:

b1, carrying out 3×3 convolution on the input feature map x to obtain a feature map m1;

b2, performing 1×1 convolution on the input feature map x to reduce the dimension to 64 channels, and using 1×3 convolution to expand the receptive field of the longitudinal rectangle to obtain an output feature map m2;

b3, performing 1X 1 convolution on the input feature map x to reduce the dimension to 64 channels, and using 3X 1 convolution to expand the receptive field of the transverse rectangle to obtain an output feature map m3;

b4, carrying out 1X 1 convolution on the input feature map x to reduce the dimension to 64 channels, and stacking two 1X 3 convolutions to expand the receptive field of the larger-scale longitudinal rectangle to obtain an output feature map m4;

b5, carrying out 1X 1 convolution on the input feature map x to reduce the dimension to 64 channels, and stacking two 3X 1 convolutions to expand the receptive field of the larger-scale transverse rectangle to obtain an output feature map m5;

and B6, adding and fusing the feature graphs m2, m3, m4 and m5, and then carrying out 1×1 convolution on the fused feature graphs to reduce the dimension to 4 channels to obtain a weight of 4×h×w, namely, one channel corresponds to one feature graph, and each channel is w2, w3, w4 and w5 respectively.

B7, then, for the weight of 4 Xh x w, the [ i, j ] position normalization summation (epsilon takes 0.0001 to prevent denominator as 0) for the 4 branches (m 2, m3, m4, m 5) is performed according to formula (6).

And B8, carrying out 1X 1 convolution dimension ascending on the normalized and summed feature map m, and then adding and fusing the feature map m with m1 to obtain an output feature map.

The overlapping detection boxes are then removed by post-processing, which in this embodiment employs a non-maximum suppression (non maximum suppression, NMS) algorithm. For the prediction of each layer of feature map, firstly, the prediction result (namely the Anchor corresponding probability and the corresponding offset mentioned above) of top 1K (filtered according to the maximum category probability of each reference frame) is taken, then unqualified results are filtered by using a threshold value of 0.05, the number of the obtained prediction results is greatly reduced, and at the moment, the prediction frames of the prediction results are decoded, instead of decoding all prediction results of model prediction, the reasoning speed can be improved.

Finally, the prediction results of all the layer feature graphs are combined together, and the final reasoning result is obtained by filtering the overlapped frames through an NMS algorithm of IoU =0.5.

The feature semantic information aiming at the bottom layer is less, but the target position is accurate; the invention filters semantic information of irrelevant present layer detection task in FPN feature fusion process by using attention module, so that five layers of feature images P3, P4, P5, P6 and P7 are more focused on task targets of the present layer, which should detect scale, because the feature fusion process mainly occurs in two layers of feature images P3 and P4, which are just detection task for small targets, after filtering irrelevant semantic information, detection accuracy of algorithm model for small targets is greatly improved.

In order to verify the technical effect of the method of the embodiment, the visual effect analysis after feature fusion is performed on the pictures of the COCO data set test set, and the target detection is performed on the pictures by using the trained model. Fig. 8 is an attention visualization diagram of a conventional RetinaNet model after feature fusion of a first layer of an FPN, and fig. 9 is an attention visualization diagram of a small object detection model after feature fusion of the first layer of the FPN in another embodiment of the present application. As shown in fig. 8, the gray areas in the graph illustrate that the position values on the corresponding feature graph are large and high in attention; conversely, the gray-black areas indicate small values of the positions on the corresponding feature map, and low attention. It can be found that, in fig. 8, substantially all the regions are gray-white, and individual regions are gray-black, so that the distinction degree is not high, and the background and the target region cannot be effectively distinguished, that is, semantic information of an irrelevant layer detection task is added. And the result after feature fusion after filtering the useless information through the attention model is shown in fig. 9. It can be obviously found that the attention visualization of a part of background areas in fig. 9 is similar to gray black, and after the attention visualization of areas with targets, the areas with targets are gray and white, and most of irrelevant semantic information is successfully filtered, so that the difficulty for detecting the detection heads is reduced, and the detection precision of the small targets can be greatly improved.

In summary, the semantic features of the position of the small target are screened out by the attention module from the feature map after the linear interpolation, so that irrelevant middle/large target semantic features are removed, the purpose that only the small target semantic features are fused with the feature map of the small target layer is achieved, the accuracy of the detection model on the small target detection is further improved greatly, and the method can achieve a better small target detection effect.

Example IV

A second aspect of the present application provides a small target detection system, fig. 10 is a schematic structural diagram of a small target detection system in an embodiment of the present application, please refer to fig. 10, where the system includes a video acquisition subsystem 10 and a small target detection subsystem 20;

the video acquisition subsystem 10 is connected with the small target detection subsystem 20 and is used for acquiring video images of a target area through video acquisition equipment and transmitting the video images of the target area to the small target detection subsystem 20;

the small target detection subsystem 20 is configured to receive the video image and perform real-time small target detection on the video image by using the small target detection method based on the attention adaptive fusion feature according to the first embodiment.

In this embodiment, the video capturing device may be a monitoring camera installed in the target monitoring area. The monitoring camera shoots monitoring videos in the monitoring area in real time. In this step, the monitoring camera may, but is not limited to, use the network monitoring camera to take the monitoring video, first place the network monitoring camera at a position capable of taking the target area, and then access the camera to read the video image through network, local transmission, and other modes.

The small target detection method based on the attention self-adaptive fusion characteristic in the first embodiment is adopted to detect the small target, so that the accuracy of the small target detection is improved.

Example five

A third aspect of the present application provides a computer device comprising: the apparatus comprises a memory and a processor, wherein the memory stores a small object detection program based on the attention adaptive fusion feature, and the small object detection program, when executed by the processor, causes the processor to execute the steps of the small object detection method based on the attention adaptive fusion feature according to any one of the above embodiments.

The computer device shown in fig. 11 may include: at least one processor 101, at least one memory 102, at least one network interface 104, and other user interfaces 103. The various components in the computer device are coupled together by a bus system 105. It is understood that the bus system 105 is used to enable connected communications between these components. The bus system 105 includes a power bus, a control bus, and a status signal bus in addition to a data bus. But for clarity of illustration the various buses are labeled as bus system 105 in fig. 11.

The user interface 103 may include, among other things, a display, a keyboard, or a pointing device (e.g., a mouse, a trackball (trackball), or a touch pad, etc.).

It will be appreciated that the memory 102 in this embodiment may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile memory may be a Read-only memory (ROM), a programmable Read-only memory (ProgrammableROM, PROM), an erasable programmable Read-only memory (ErasablePROM, EPROM), an electrically erasable programmable Read-only memory (ElectricallyEPROM, EEPROM), or a flash memory, among others. The volatile memory may be a random access memory (RandomAccessMemory, RAM) that acts as an external cache. By way of example, and not limitation, many forms of RAM are available, such as Static RAM (SRAM), dynamic random access memory (DynamicRAM, DRAM), synchronous dynamic random access memory (SynchronousDRAM, SDRAM), double data rate synchronous dynamic random access memory (ddr SDRAM), enhanced Synchronous Dynamic Random Access Memory (ESDRAM), synchronous link dynamic random access memory (SynchlinkDRAM, SLDRAM), and direct memory bus random access memory (DirectRambusRAM, DRRAM). The memory 102 described herein is intended to comprise, without being limited to, these and any other suitable types of memory.

In some implementations, the memory 102 stores the following elements, executable units or data structures, or a subset thereof, or an extended set thereof: an operating system 1021, and application programs 1022.

The operating system 1021 includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, for implementing various basic services and processing hardware-based tasks. Applications 1022 include various applications for implementing various application services. A program for implementing the method of the embodiment of the present invention may be included in the application program 1022.

In an embodiment of the present invention, the processor 101 is configured to execute the method steps provided in the first aspect by calling a program or an instruction stored in the memory 102, specifically, a program or an instruction stored in the application 1022.

The method disclosed in the above embodiment of the present invention may be applied to the processor 101 or implemented by the processor 101. The processor 101 may be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in the processor 101 or instructions in the form of software. The processor 101 described above may be a general purpose processor, a digital signal processor, an application specific integrated circuit, an off-the-shelf programmable gate array or other programmable logic device, a discrete gate or transistor logic device, a discrete hardware component. The disclosed methods, steps, and logic blocks in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software elements in a decoding processor. The software elements may be located in a random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in the memory 102, and the processor 101 reads information in the memory 102, and in combination with its hardware, performs the steps of the method described above.

In addition, in combination with the small target detection method based on the attention adaptive fusion feature in the above embodiment, the embodiment of the present invention may provide a computer readable storage medium, on which a small target detection program based on the attention adaptive fusion feature is stored, and when the small target detection program is executed by a processor, the processor executes the steps of any one of the small target detection methods based on the attention adaptive fusion feature in the above method embodiment.

It should be noted that in the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. The use of the terms first, second, third, etc. are for convenience of description only and do not denote any order. These terms may be understood as part of the component name.

Furthermore, it should be noted that in the description of the present specification, the terms "one embodiment," "some embodiments," "example," "specific example," or "some examples," etc., refer to a specific feature, structure, material, or characteristic described in connection with the embodiment or example being included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art upon learning the basic inventive concepts. Therefore, the appended claims should be construed to include preferred embodiments and all such variations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, the present invention should also include such modifications and variations provided that they come within the scope of the following claims and their equivalents.

Claims

1. A small target detection method based on attention self-adaptive fusion characteristics, which is characterized by comprising the following steps:

2. The small target detection method based on the attention adaptive fusion feature according to claim 1, characterized in that the method comprises the following steps:

s1, acquiring an image to be detected;

3. The method for small object detection based on attention adaptive fusion features according to claim 2, wherein the step of extracting the multi-scale feature map by the backbone network module comprises:

4. The small target detection method based on attention self-adaptive fusion features according to claim 2, wherein the feature fusion module adjusts the fusion proportion of the upper feature map in the adjacent two-layer feature map by using an attention mechanism for the multi-scale feature map, and the method comprises the following steps:

wherein P is _i Is a fusion feature map after attention is added,

indicating convolution processing with a 3 x 3 convolution kernel,/->

5. The small object detection method based on attention adaptive fusion features according to claim 4, wherein attention is added to the inputted feature map according to the following formula:

f _att (x _in )＝x _in *(sigmoid(conv _1×1 (conv _1×1 (x _in )))

6. The method for detecting small objects based on attention adaptive fusion features according to claim 2, wherein the method for detecting the fusion feature map by the detection head module comprises the following steps:

f _output1 ＝conv _3×3 (Rconv _3×3 (RFASM(RFASM(Rconv _3×3 (x)))))

7. The small target detection method based on attention adaptive fusion features of claim 6, wherein the receptive field adaptive selection module expands square receptive fields to square receptive fields and rectangular receptive fields according to the following formula:

f _next ＝conv _1×1 (m)+m ₁

m ₁ ＝conv _3×3 (x)

m ₂ ＝conv _1×3 (conv _1×1 (x))

m ₃ ＝conv _3×1 (conv _1×1 (x))

m ₄ ＝conv _1×3 (conv _1×3 (conv _1×1 (x)))

m ₅ ＝conv _3×1 (conv _3×1 (conv _1×1 (x)))

w＝conv _1×1 (m ₂ +m ₃ +m ₄ +m ₅ )

8. The small object detection method based on the attention adaptive fusion feature according to claim 2, further comprising, after S1, before S2:

9. The small object detection method based on the attention adaptive fusion feature according to claim 2, wherein S2 further comprises:

10. The small target detection system is characterized by comprising a video acquisition subsystem and a small target detection subsystem;

the video acquisition subsystem is connected with the small target detection subsystem and is used for acquiring video images of a target area through video acquisition equipment and sending the video images of the target area to the small target detection subsystem;

the small target detection subsystem is used for receiving the video image and carrying out real-time small target detection on the video image by adopting the small target detection method based on the attention self-adaptive fusion characteristic as claimed in any one of claims 1 to 9.