CN117788791A

CN117788791A - Open-air small target detection method based on multiscale fusion

Info

Publication number: CN117788791A
Application number: CN202311729189.3A
Authority: CN
Inventors: 罗珊珊; 刘建坡; 孟方舟
Original assignee: Shanghai Institute of Microsystem and Information Technology of CAS
Current assignee: Shanghai Institute of Microsystem and Information Technology of CAS
Priority date: 2023-12-15
Filing date: 2023-12-15
Publication date: 2024-03-29

Abstract

The invention relates to a field small target detection method based on multi-scale fusion, which comprises the following steps: receiving a field visible light image, and performing contrast enhancement processing on the field visible light image; inputting the field visible light image subjected to contrast enhancement treatment into a multi-scale fusion small target detection model to obtain a target detection result; the multi-scale fusion small target detection model comprises the following components: the backbone network part is used for extracting multi-scale features and depth feature information containing local features and global features from the field visible light image subjected to contrast enhancement processing; the neck network part is used for carrying out multi-scale fusion on the extracted multi-scale features and depth feature information containing local features and global features, and retaining the features of the small targets during fusion; and the head network part is used for predicting the coordinate position of the target class set according to the characteristic information after the multi-scale fusion to obtain a target detection result. The invention can effectively improve the anti-interference capability of small target detection.

Description

Open-air small target detection method based on multiscale fusion

Technical Field

The invention relates to the technical field of computer vision, in particular to a field small target detection method based on multi-scale fusion.

Background

Target detection is one of the most common challenges in computer vision, is the basis for solving higher-level visual tasks such as picture segmentation, target tracking, image description, event detection, scene understanding and the like, is an important precondition in a plurality of high-definition monitoring tasks, and has huge development space in a plurality of fields such as military and daily life and the like. Small target detection is a difficulty in target detection for a long time because the target carries less information, has high positioning accuracy requirements and is easy to be submerged by environmental noise.

In field remote monitoring, accurate detection of small targets with pixel values in the range of [10×10, 32×32] is required. The difficulty of detecting small targets in the field is that the information content of the targets detected by the small targets is very small, and the problems that the shot images are too dark or too bright due to obvious light and shade change exist, which are very challenging to detect the small targets in the field.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a field small target detection method based on multi-scale fusion, which can effectively improve the anti-interference capability of small target detection.

The technical scheme adopted for solving the technical problems is as follows: the field small target detection method based on multi-scale fusion comprises the following steps:

receiving a field visible light image, and performing contrast enhancement processing on the field visible light image;

inputting the field visible light image subjected to contrast enhancement treatment into a multi-scale fusion small target detection model to obtain a target detection result; the multi-scale fusion small target detection model is constructed based on a YOLOv5 model and comprises the following components:

the backbone network part is used for extracting multi-scale features and depth feature information containing local features and global features from the field visible light image subjected to the contrast enhancement processing;

the neck network part is used for carrying out multi-scale fusion on the extracted multi-scale features and depth feature information containing local features and global features, and retaining the features of the small targets during fusion;

and the head network part is used for predicting the coordinate position of the target class set according to the characteristic information after the multi-scale fusion to obtain a target detection result.

The contrast enhancement processing is performed on the field visible light image, specifically: and (3) improving the local contrast ratio of the small target and the background environment in the field visible light image by adopting a CLAHE algorithm.

The backbone network part comprises a Focus structure, a CSP structure and an SPPF structure which are sequentially arranged; the CSP structure is used for extracting multi-scale characteristics of the field visible light image after the contrast enhancement treatment; and the SPPF structure is also connected with a shortcut branching module in parallel, and the maximal pooling layer branching in the SPPF structure and the output of the shortcut branching module are combined to obtain depth characteristic information containing local characteristics and global characteristics in the field visible light image after the contrast enhancement processing.

And the neck network part adopts a BiFPN+PAN structure, and fuses the multi-scale characteristics of the outputs of the second convolution layer, the third convolution layer, the fourth convolution layer and the fifth convolution layer of the CSP structure of the backbone network part and depth characteristic information containing local characteristics and global characteristics.

The BiFPN in the neck network part learns a weight on each scale for fusing the scale with the features of other scales, wherein the weight distribution mode is as follows:w _i weights representing the ith scale feature, w _j Represents the weight learned on the jth scale, I _i For features of the ith scale, O represents the post-fusion feature and ε is a small value.

And a CA attention module is arranged on the shallow layer feature in the neck network part and is used for decomposing the channel attention into two one-dimensional feature codes for aggregating features along different directions, and the generated feature images are respectively encoded to form a pair of feature images sensitive to direction perception and position so as to enhance the attention degree of small target features in the shallow layer feature.

Advantageous effects

Due to the adoption of the technical scheme, compared with the prior art, the invention has the following advantages and positive effects: according to the invention, the local features and the global features are simultaneously extracted by the extracted features to acquire more feature information, and the neck network part is improved at the same time, so that the features of the small target are reserved as much as possible, the influence of the small target information quantity, light and shade interference on the field small target detection is relieved, and the system can predict more accurate results.

Drawings

FIG. 1 is a flow chart of a field small target detection method based on multi-scale fusion according to an embodiment of the invention;

FIG. 2 is a block diagram of a multi-scale fusion small target detection model in an embodiment of the invention;

FIG. 3 is a flow chart of an embodiment of the present invention;

FIG. 4 is a schematic diagram of an SPPF configuration in accordance with an embodiment of the present invention;

fig. 5 is a diagram showing the detection result according to the embodiment of the present invention.

Detailed Description

The invention will be further illustrated with reference to specific examples. It is to be understood that these examples are illustrative of the present invention and are not intended to limit the scope of the present invention. Further, it is understood that various changes and modifications may be made by those skilled in the art after reading the teachings of the present invention, and such equivalents are intended to fall within the scope of the claims appended hereto.

The embodiment of the invention relates to a field small target detection method based on multi-scale fusion, which is shown in fig. 1 and comprises the following steps:

step 1, receiving a field visible light image, and performing contrast enhancement processing on the field visible light image. The contrast enhancement processing in this embodiment means that the local contrast between the small object and the background environment in the field visible light image is improved by using the CLAHE algorithm, so as to highlight the difference between the small object and the background.

And 2, inputting the field visible light image subjected to contrast enhancement treatment into a multi-scale fusion small target detection model to obtain a target detection result.

As shown in fig. 2, the multi-scale fusion small target detection model in the present embodiment is constructed based on a YOLOv5 model, and includes a backbone network part, a neck network part, and a head network part.

The backbone network part is used for extracting multi-scale features and depth feature information containing local features and global features from the field visible light image subjected to the contrast enhancement processing.

The backbone network part of this embodiment is sequentially provided with a Focus structure, a CSP structure, and an SPPF structure. The CSP structure is used for extracting multi-scale characteristics of the field visible light image after the contrast enhancement treatment. The SPPF structure is used for merging the features to obtain depth feature information containing local features and global features.

The neck network part is used for carrying out multi-scale fusion on the extracted multi-scale features and depth feature information containing local features and global features, and retaining the features of the small targets during fusion.

In this embodiment, accurate detection of a small target with a pixel value in the range of [10×10, 32×32] is required, yolov5 originally only uses the outputs of the third, fourth and fifth convolution layers of the backbone network as the inputs of the neck network, for example, a small target with a width of 10 pixels is displayed on the feature map after passing through the third convolution layer, and the remaining feature information is substantially absent. The neck network portion of the present embodiment therefore takes as input the output of the second layer convolution of the CSP structure in the backbone network portion along with the outputs of the third layer convolution, the fourth layer convolution, and the fifth layer convolution of the CSP structure. Because the contribution of the feature layers of different scales to the output is different, the neck network part of the embodiment learns a weight on each scale for fusing the features of the scale and other scales, and fuses the feature map of the shallow layer and the feature map of the deep layer of the model. The embodiment also adds the CA attention module to the neck network part, and by adding the CA attention module to the shallow feature map, the attention of the shallow feature to the small target feature can be enhanced.

The head network part is used for predicting the coordinate position of the target class set according to the characteristic information after the multi-scale fusion to obtain a target detection result. The head network part of the present embodiment includes four detection heads, and prediction information on different scales is obtained by using different detection heads.

The invention is further illustrated by a specific example.

As shown in fig. 3, the method of the present embodiment includes a training phase and a testing phase.

Wherein the training phase comprises the following steps:

step 1: and collecting and labeling field visible light images, and constructing a field small target image data set.

The method specifically comprises the following steps:

and collecting image data in the wild, and labeling personnel targets and vehicle targets according to a Yolo data set format.

Preprocessing is performed on the data set, and the data set is divided into three parts, namely a training set, a verification set and a test set.

Local contrast enhancement is carried out on the data by adopting a CLAHE algorithm, and the distinction between a small target and a background is highlighted.

Step 2: and constructing a YOLOv5 target detection network.

The YOLOv5 model built in the step mainly consists of a backbone network, a neck network and a head network. The backbone network mainly comprises a Focus structure, a CSP structure and an SPPF structure and is used for extracting characteristics of an input image. The neck network adopts a structure of BiFPN+PAN, and is used for fusing the characteristic diagrams output by the 3 rd, 4 th and 5 th convolution layers of the backbone network. The header network consists of 4 parallel convolutional layers, an anchor box is applied to the four output feature maps of the neck network, and a final output vector containing the class, confidence, bounding box is generated.

Step 3: the architecture in the backbone network is modified.

As shown in fig. 4, in this step, a shortcut branching module is connected in parallel to the input and output ends of the SPPF network structure of the backbone network, so that the features of the input SPPF network structure are branched by the maximum pooling layer to obtain a first output, and meanwhile, a second output is obtained by the shortcut branching module, and the first output and the second output are combined to obtain a fusion feature map containing local features and global features.

Step 4: a multi-scale feature fusion method in a neck network is modified.

In the step, the output of the 2 nd convolution layer of the CSP structure in the backbone network and the output of the 3 rd convolution layer, the 4 th convolution layer and the 5 th convolution layer of the CSP structure in the original backbone network are taken as the input of the neck network, so that the loss of the small target characteristic information is reduced.

In the step, a neck network of the YOLOv5 is improved by a BiFPN structure, and multiscale feature maps P2, P3, P4 and P5 output by the 2 nd, 3 rd, 4 th and 5 th convolution layers of the CSP structure in a backbone network are fused more simply and more rapidly. The contribution of feature layers of different scales to the output is different, here different factors are used for weight fusion, which is learned by the BiFPN network, so that the BiFPN can learn a weight on each scale for fusing the scale with features of other scales. The weight distribution mode is as follows:w _i weights representing the ith scale feature, w _j Represents the weight learned on the jth scale, I _i For the feature of the ith scale, O represents a feature after fusion, epsilon is a small value, and epsilon=0.0001 in this embodiment, and the small value is set to avoid unstable values. Each weight value is between 0 and 1.

Step 5: CA attention module is added to the bottom layer feature output part of the neck network.

This step adds a CA attention module to the neck network structure, i.e., places the CA attention module on shallow features in the neck network portion. The CA attention module is used for decomposing the channel attention into two one-dimensional feature codes for aggregating features along different directions, and respectively coding the generated feature graphs to form a pair of direction perception and position sensitivity feature graphs, and the direction perception and position sensitivity feature graphs can be complementarily applied to the input feature graphs to enhance the attention degree of small target features in shallow features. Aiming at the problems that the available information of a small target is small and the background is extremely complex, in order to focus the small target, a CA attention module is added after the shallow feature map P2 of the neck network is output, which is helpful for reducing unnecessary shallow feature information in the background, enabling a model to pay attention to a specific area, and reducing the influence of other irrelevant information such as the background on detection.

Step 6: training the built model, specifically:

preprocessing data, and enhancing the data in the modes of inversion, rotation, translation, brightness adjustment, pasting and copying and the like;

the setting parameters are as follows: initial learning rate lr=0.01, decay Weight weight=0.0005, with random gradient descent (SGD), momentum momentum=0.937, batch size batch_size=120, training Batch epoch=200;

the training set and the verification set are input into a field small target detection network based on multi-scale fusion, the input image size is 640 multiplied by 640, the training set is used for learning of the network, and the verification set is used for verifying the training effect. The trained model is the multi-scale fusion small target detection model.

The testing phase comprises the following steps:

inputting the test set into a multi-scale fusion small target detection model to obtain an output vector, wherein the content comprises a target category, a confidence coefficient and a bounding box, removing the repeated target frame by a non-maximum suppression NMS method to obtain a final detection result, and outputting the position coordinates of the target in the image, and the target category and the confidence coefficient.

The experimental platform of this embodiment is based on the linux operating system, GPU Nvidia GeForce RTX 3090, intel (R) Xeon (R) CPU E5-2643 v4@3.40GHz. Code running framework Pytorch, running environment CUDA12.1, python3.10.

To verify the effectiveness of the method, the method of this embodiment is compared with the detection result of the baseline algorithm YOLOv 5. The following table shows:

table 1 results of comparative experiments with the prior art

As can be seen from the results in table 1, the network mAP50 introduced into the multi-scale fusion method is raised to 0.701 from the original 0.681, and the mAP95 is raised to 0.434 from 0.408, so that it is proved that the field small target detection method based on multi-scale fusion provided by the embodiment can effectively improve the small target detection accuracy, and the effectiveness of the method is shown. Fig. 5 is a diagram showing the detection result by the method according to the present embodiment.

It is easy to find that the invention extracts the local feature and the global feature at the same time to acquire more feature information, improves the neck network part, retains the features of the small target as much as possible, relieves the influence of the small target information quantity and light and shade interference on the field small target detection, and enables the system to predict more accurate results.

Claims

1. The field small target detection method based on multi-scale fusion is characterized by comprising the following steps of:

2. The method for detecting a small field target based on multi-scale fusion according to claim 1, wherein the contrast enhancement processing is performed on the field visible light image, specifically: and (3) improving the local contrast ratio of the small target and the background environment in the field visible light image by adopting a CLAHE algorithm.

3. The method for detecting a small field target based on multi-scale fusion according to claim 1, wherein the backbone network part comprises a Focus structure, a CSP structure and an SPPF structure which are sequentially arranged; the CSP structure is used for extracting multi-scale characteristics of the field visible light image after the contrast enhancement treatment; and the SPPF structure is also connected with a shortcut branching module in parallel, and the maximal pooling layer branching in the SPPF structure and the output of the shortcut branching module are combined to obtain depth characteristic information containing local characteristics and global characteristics in the field visible light image after the contrast enhancement processing.

4. The method for detecting the field small target based on the multi-scale fusion according to claim 1, wherein the neck network part adopts a structure of BiFPN+PAN, and the multi-scale features of the outputs of the second convolution layer, the third convolution layer, the fourth convolution layer and the fifth convolution layer of the CSP structure of the backbone network part and depth feature information comprising local features and global features are fused.

5. The method for detecting a small field target based on multi-scale fusion according to claim 4, wherein the BiFPN in the neck network part learns a weight on each scale for fusing the features of the scale with other scales, wherein the weight distribution manner is as follows:w _i weights representing the ith scale feature, w _j Represents the weight learned on the jth scale, I _i For features of the ith scale, O represents the post-fusion feature and ε is a small value.

6. The method for detecting the small field target based on the multi-scale fusion according to claim 4, wherein a CA attention module is arranged on the shallow layer feature in the neck network part, the CA attention module is used for decomposing the channel attention into two one-dimensional feature codes for aggregating features along different directions, and the generated feature images are respectively encoded into a pair of feature images sensitive to direction perception and position so as to enhance the attention degree of the small target feature in the shallow layer feature.