CN115063691B

CN115063691B - Feature enhancement-based small target detection method in complex scene

Info

Publication number: CN115063691B
Application number: CN202210780211.6A
Authority: CN
Inventors: 潘晓英; 贾凝心; 王昊; 丁雅眉
Original assignee: Xian University of Posts and Telecommunications
Current assignee: Xian University of Posts and Telecommunications
Priority date: 2022-07-04
Filing date: 2022-07-04
Publication date: 2024-04-12
Anticipated expiration: 2042-07-04
Also published as: CN115063691A

Abstract

The invention belongs to the field of computer vision and target detection, and particularly relates to a small target detection method under a complex scene based on feature enhancement. The technical scheme of the invention is as follows: firstly, a Cutout-DA data enhancement method is provided, new shielding data are generated and expanded into a VisDrone2021 data set, then a multi-scale fused characteristic enhancement path aggregation network MSFE-PANet is designed, richer and finer semantic information characteristics and spatial information characteristics are obtained through an integrated attention mechanism, characteristic fusion and a network prediction scale strategy aiming at a small target, a prediction frame rejection Loss function RB_loss is designed, and finally a model is trained. The invention can enhance the mutual fusion of the strong positioning information of the deep feature map and the strong semantic information of the shallow feature map, help the network to find the region of interest in the complex scene and improve the sensitivity to the small target. And the RB_Loss rejection Loss function and the network prediction scale are designed to solve the problems of overlapping, missing detection of small shielding targets and false detection under a complex background.

Description

Feature enhancement-based small target detection method in complex scene

Technical Field

The invention belongs to the field of computer vision and target detection, and particularly relates to a small target detection method under a complex scene based on feature enhancement.

Background

In recent years, the rapid development of deep learning technology has prompted remarkable breakthrough in computer vision, and has been pushed to unprecedented research hotspots. The main task of computer vision is to parse images, including classification, detection and segmentation of images. Target detection is used as one of the core research directions in the field of computer vision, and specific target classes are found through accurate positioning by using a correlation algorithm. The small target detection has the same important application value as the difficulty of target detection, and plays an important role in the fields of automatic driving, intelligent medical treatment, defect detection, aerial image analysis and the like. Detecting small, remote objects in a high resolution scene photograph of an automobile is a necessary condition for safe deployment of an autonomous automobile; meanwhile, in medical imaging, if the tumor mass and the tumor with the size of only a few pixels can be found early, the early detection is important for accurate and early diagnosis; automated industrial inspection can also benefit from small target inspection by locating small defects that can be seen on the surface of the material. In conclusion, the small target detection has wide application value and important research significance.

Although the target detection algorithm has made a major breakthrough, the study of small targets is still less than ideal due to the significant gap in performance between detecting small targets and large targets. The small target detection in the existing method can not be well applied to actual complex scenes, and mainly has the following problems. 1. Visual characteristics are not obvious: the difficulty in detecting the small target is that the target features are not obvious, available information is less, if the resolution of the image is low, the small target can be represented by a few pixels, and under the condition that the visual features are not obvious, the accurate detection of the small target is a great challenge at present; 2. feature extraction problem: in target detection, the quality of feature extraction directly affects the performance of final detection, and features of small targets are more difficult to extract than those of large-scale targets. Most computer vision architectures use a pooling layer, and some of the features of the small objects are deleted after pooling. Extracting effective small target features in deep neural networks is also a current problem; 3. background interference problems. Small target detection in complex environments is subject to interference from factors such as illumination, complex geographic elements, occlusion, aggregation, etc., so it is difficult to distinguish them from background or similar targets, and effectively improving complex background interference is also a current challenge.

Disclosure of Invention

Aiming at the problems that small targets cannot be accurately detected, characteristics are difficult to extract and detection cannot well meet actual complex scenes in the prior art, the invention provides a small target detection method under complex scenes based on characteristic enhancement.

In order to achieve the above purpose, the technical scheme of the invention is as follows: a small target detection method under a complex scene based on feature enhancement comprises the following steps of

Step 1, data preparation: the dataset is derived from an aerial image;

step 2, data enhancement: the method comprises the steps of firstly, randomly selecting partial data images in a data set, and then randomly shielding the visible partial targets and all targets in the images according to the size proportion of 0.2, 0.4, 0.6 and 0.8 of the targets to generate new shielding data to expand the new shielding data into the VisDrone2021 data set;

step 3, designing a multi-scale fused characteristic enhanced path aggregation network MSFE-PANet;

step 3.1: improving network prediction scale

Removing the prediction head YOLO head3 aiming at the detection large target in YOLO v4, but retaining the corresponding 13 x 13 characteristic diagram; meanwhile, a prediction head YOLO head0 for detecting a small-scale target, which is generated by the shallow high-resolution feature map 104 x 104, is added in the prediction network, and a new network prediction scale structure is generated.

Step 3.2: feature layer fusion

Carrying out corresponding multiple up-sampling on the feature images extracted by each layer of feature network on a new network prediction scale structure, and respectively adding and fusing the feature images with the first layer of feature images to obtain new feature images;

step 3.3: an attention module;

step 4: the prediction block rejection Loss function rb_loss is designed.

Step 5: and training a model.

Step 3.3 above: integrating attention mechanisms in PANet

Step 3.3.1: a CBAM attention module is added as shown in equation (1).

The calculation formula of the channel attention is (2): wherein sigma is a Sigmoid activation function, and MLP weights W ₀ And W is ₁ Is shared by

The calculation formula of the spatial attention is (3): wherein σ is a Sigmoid activation function, f ^7*7 A filter denoted 7*7.

Step 3.3.2: improving the channel attention module of the CBAM;

step 3.3.3: introducing an SE-attention module;

step 3.3.4: improving the SPP module;

step 3.3.5: the SE-attention module is optimized.

Step 3.3.2 above, the calculation formula is defined as (4):

step 3.3.3, giving an input X, the number of channels is C ₁ Through F _tr Is subjected to a series of convolution and pooling operations to obtain a channel number C ₂ Is characterized by U; f (F) _sq For feature compression operation, feature compression is carried out along the space dimension, and each two-dimensional feature channel is changed into one pixel; followed by F _ex Excitation operations, then weighted by multiplication onto previous features

Calculation formula (5):

in (a): u (U) _C Representing the C-th channel in the feature map; z is Z _C Is the output of the compression operation. The sigma is a Sigmoid activation function; w (W) ₁ ,W ₂ All are all fully connected operation; delta is the ReLU activation function. Calculation formula (7) S _C Is the C-th weight in step S.

S＝F _ex (Z,W)＝σ(g(Z,W))＝σ(W ₂ δ(W ₁ Z)) (6)

F _scale ＝(U _C ,S _C )＝S _C ·U _C (7)

The step 3.3.4 specifically comprises the following steps: changing the pooling layer of the kernels with sizes of 1,5, 9 and 13 in the SPP into 1*1 convolution and 3*3 hole convolution, the improved SPP module does not change the size of the feature map, and the output feature map size calculation formula is (8):

the step 3.3.5 specifically comprises the following steps: an improved SPP module is added in the SE-attention to obtain an SSE-attention module.

The step 4 specifically comprises the following steps: taking the degree of overlap IOU between the prior prediction frames of the two overlapped targets as the loss value, optimizing a back propagation network according to the gradient direction, separating the overlapped prior prediction frames of the two targets, defining the prior prediction frames as (9) expressing the matching of different target frames in the formula

Compared with the prior art, the invention has the beneficial effects that:

1. compared with the Mosaic and CutMix data enhancement methods of the reference network YOLOv4, the method has the advantage that mAP on the data set VisDrone2021 is improved by 3.57 precision by adopting the Cutout-DA data enhancement strategy designed by the method. The result fully shows the effectiveness of the detection algorithm in adopting a Cutout-DA strategy aiming at a small target; in the YOLOv4 prediction network, because the output prior prediction frames need to be judged and processed by the NMS, mutual shielding and overlapping targets can influence target frame matching, so that a large number of conditions of missed detection and false detection are caused. The RB_loss provided by the invention further reduces the mutual influence of shielding target detection through the IOU, and the mAP improves the accuracy by 2.8.

2. According to the multi-scale fusion characteristic enhanced path aggregation network MSFE-PANet, richer and finer semantic information characteristics and spatial information characteristics can be obtained through network prediction scale strategy and multi-scale characteristic fusion aiming at small targets, and the mAP improves 9.47 precision on two strategies of Cutout-DA and RB_Loss, so that the accuracy of small target detection is greatly improved; and then an LW-CBAM and SSE-Attention mechanism is added, so that an Attention area is further extracted, the network is helped to concentrate on useful small target objects, the mAP improves 6.63 accuracies, and the problems of omission and false detection of overlapping and shielding small targets under a complex background are solved.

3. The invention can accurately detect small targets, has easy feature extraction and can meet various actual complex scenes. The application range is wide, and the adaptability is strong.

Drawings

FIG. 1 is a diagram of a multi-scale converged feature enhanced path aggregation network MSFE-PANet structure in the present invention;

FIG. 2 is a detailed structure diagram of MSFE-PANet in the present invention;

FIG. 3 is a predicted scale improvement architecture in accordance with the present invention;

FIG. 4 is a channel attention structure of a CBAM according to the present invention;

FIG. 5 is a spatial attention structure of a CBAM according to the present invention;

FIG. 6 is a comparison of results from different modules in the present invention;

FIG. 7 illustrates various embedding patterns of the attention module according to the present invention;

FIG. 8 is a detailed result image of MSFE-PANet in an embodiment of the present invention;

FIG. 9 is a visual result image of MSFE-PANet in an embodiment of the present invention;

Detailed Description

The invention is further described below with reference to the accompanying drawings. The following examples are only for more clearly illustrating the technical aspects of the present invention, and are not intended to limit the scope of the present invention.

The multi-scale fused characteristic enhanced path aggregation network MSFE-PANet can enhance the mutual fusion of the strong positioning information of the deep characteristic diagram and the strong semantic information of the shallow characteristic diagram, help the network to find the region of interest in the complex scene and improve the sensitivity to the small target. And designing an RB_Loss rejection Loss function and a network prediction scale to solve the problems of overlapping and small target shielding missed detection and false detection under a complex background.

Referring to fig. 1, the method for detecting the small target in the complex scene based on the feature enhancement provided by the invention comprises the following steps:

step 1: data preparation. The method comprises the following steps: a large aerial dataset visclone 2021 was used, the image size of which was approximately 2000 x 1500, containing a variety of scenes from country to city, and containing various climate changes, light and shade changes, and shooting angle changes, etc., while including 10 categories of pedestrians, automobiles, bicycles, and tricycles, 6471 images in the dataset were used for training, 548 images were used for verification, and 1610 images were used for testing.

Step 2: data enhancement. The method comprises the following steps: according to the method, partial data images are selected at will in a data set, then partial positions of visible partial targets and all targets in the images are shielded randomly according to 0.2, 0.4, 0.6 and 0.8 of the size proportion of the targets, new shielding data are generated and expanded into the data set, robustness of a model on the shielding targets is enhanced, and accuracy of judgment on the shielding targets is improved.

Step 3: : and (5) algorithm design. The method comprises the following steps: and designing a multi-scale fusion characteristic enhanced path aggregation network FEMF-PANet.

Step 3.1: the network prediction scale is improved, see fig. 3. Removing the prediction head YOLO head3 aiming at the detection large target in YOLO v4, but retaining the corresponding 13 x 13 characteristic diagram; meanwhile, a prediction head YOLO head0 which is generated by the shallow high-resolution feature graphs 104 x 104 and aims at detecting a small-scale target is added in the prediction network, so that a new network prediction scale structure is obtained.

Step 3.2: referring to fig. 3, the feature layer fusion is specifically: the feature images extracted from each layer of feature network are up-sampled by corresponding times on a new network prediction scale structure, and are respectively added and fused with the first layer of feature images to obtain new feature images, so that the feature prediction network is finer, and the detection precision of small targets is improved;

step 3.3: the attention module is introduced. The method comprises the following steps: attention mechanisms are integrated in PANet.

Step 3.3.1: a CBAM attention module is added. CBAM can be integrated into most CNN network frameworks, enabling end-to-end training. Given an intermediate feature map as input, the CBAM sequentially extrapolates the attention pattern along two independent dimensions of the channel and space, as shown in equation (1).

Referring to fig. 4, an improved CBAM attention module is introduced, wherein the channel attention is calculated as (10): sigma is a Sigmoid activation function, and MLP weights W ₀ And W is ₁ Is shared by

Referring to FIG. 5, at the spatial attention module, a spatial attention pattern is generated using the spatial relationships of features.

The calculation formula of the spatial attention is (3): sigma is a Sigmoid activation function, f ^7*7 A filter denoted 7*7.

Step 3.3.2: the channel attention module of CBAM is improved. The present invention uses the convolution of 1*1 instead of the fully connected layers in the channel attention module, resulting in a lighter weight convolved attention module LW-CBAM. The calculation formula can be defined as (11):

step 3.3.3: a SE-attention module is introduced. The method comprises the following steps: first, an input X is given, and the number of channels is C ₁ Through F _tr Is subjected to a series of convolution and pooling operations to obtain a channel number C ₂ Is characterized by U; f (F) _sq For the feature compression operation, feature compression is carried out along the space dimension, each two-dimensional feature channel is changed into a pixel, the pixel has a global receptive field, and the output dimension is matched with the input feature channel number; followed by F _ex Excitation operation, based on correlation among characteristic channels, each characteristic channel generates a weight to represent importance degree of each characteristic channel, and then the importance degree is weighted to the previous characteristic through multiplication to finish the calibration of the important characteristic. Calculation formula (5): u (U) _C Representing the C-th channel in the feature map; z is Z _C Is the output of the compression operation. The sigma is a Sigmoid activation function; w (W) ₁ ,W ₂ All are all fully connected operation; delta is the ReLU activation function. Calculation formula (7) S _C Is the C-th weight in step S.

S＝F _ex (Z,W)＝σ(g(Z,W))＝σ(W ₂ δ(W ₁ Z)) (6)

F _scale ＝(U _C ,S _C )＝S _C ·U _C (7)

Step 3.3.4: referring to fig. 2, the SPP module is modified. The method comprises the following steps: the pooling layer of the 1 x 1,5 x 5,9 x 9 and 13 x 13 size kernels in the SPP was changed to 1*1 convolution and 3*3 hole convolution, but the improved SPP module did not change the size of the feature map. The output characteristic diagram size calculation formula is (13):

step 3.3.5: referring to fig. 2, an optimized SE-attention module is integrated, and an improved SPP module is added in the SE-attention to enhance the expression capability of feature information of a feature map input into the SE-attention, thereby achieving a better classification effect.

Referring to fig. 7, the present invention embeds LW-CBAM and SSE-Attention mechanism modules in two different regions of the neck and detection head of the network, respectively, according to the new network prediction scale structure, to enhance important channels and spatial features. And experimental verification is carried out by adopting four embedding modes, so that an optimal MSFE-PANet network model is obtained, and the performance of small target detection is improved.

Referring to fig. 8, after the optimal attention module is embedded, the method of the invention overlaps small targets, gathers the small targets and blocks the detailed effects of small target detection in a complex background.

Step 4: in the model training scheme, a prediction block rejection Loss function RB_Loss is designed

The method comprises the following steps: the degree of overlap IOU between the a priori predicted frames of two overlapping targets is taken as the value of the penalty. The larger the overlap, the larger the value of the loss function, and in the training phase, the back propagation network will be optimized according to the gradient direction, separating the overlapping a priori prediction frames of the two targets. The rejection loss function is combined with the YOLOv4 model, so that the rejection loss function accords with small target detection under a complex application scene, and the problem that targets in an image are mutually shielded and overlapped is effectively solved. Defined as (9) a priori prediction box in the formula and representing the matching of different target boxes.

The method accords with small target detection in a complex application scene, and effectively solves the problem that targets in images are mutually shielded and overlapped;

step 5: the training model, the network was trained on the visclone 2021 dataset with 200epochs, the experiment set the input picture Size to 416 x 416, the first 100epoch set Batch Size to 4, and the second 100epoch set Batch Size to 8.

Referring to fig. 6, the method of the present invention was validated against the visclone 2021 dataset and compared to the detection performance of the baseline network YOLOv 4. The effectiveness of the method for detecting the small target in the complex scene is verified by gradually adding corresponding modules such as a Cutout-DA data enhancement method, an attention module, an RB_loss and the like.

Referring to fig. 9, the method of the present invention compares the result with other methods, and can accurately detect the missing detection and false detection of the small target, and adapt to the detection task of the small target in the complex scene.

Claims

1. A small target detection method under a complex scene based on feature enhancement comprises the following steps of

Step 1, data preparation: the dataset is derived from an aerial image;

step 2, data enhancement: the method comprises the steps of firstly, randomly selecting partial data images in a data set, then randomly shielding the visible partial targets and all targets in the images according to the size proportion of 0.2, 0.4, 0.6 and 0.8 of the targets, generating new shielding data and expanding the new shielding data into a VisDrone2021 data set;

step 3.1: improving network prediction scale

Removing the prediction head YOLO head3 aiming at the detection large target in YOLO v4, but retaining the corresponding 13 x 13 characteristic diagram; meanwhile, a prediction head YOLO head0 which is generated by shallow high-resolution feature graphs 104 x 104 and aims at detecting a small-scale target is added in a prediction network, and a new network prediction scale structure is generated;

step 3.2: feature layer fusion

step 3.3: an attention module;

step 4: designing a prediction frame rejection Loss function RB_Loss;

step 5: training a model;

the step 3.3 specifically comprises

Step 3.3.1: adding a CBAM attention module as shown in a formula (1);

The calculation formula of the spatial attention is (3): wherein σ is a Sigmoid activation function, f ^7*7 A filter denoted 7*7;

step 3.3.2: improving the channel attention module of the CBAM;

step 3.3.3: introducing an SE-attention module;

step 3.3.4: improving the SPP module;

step 3.3.5: optimizing the SE-attention module;

said step 3.3.2, the calculation formula is defined as (4)

Step 3.3.3, giving an input X, the channel number is C ₁ Through F _tr Is subjected to a series of convolution and pooling operations to obtain a channel number C ₂ Is characterized by U; f (F) _sq For feature compression operation, feature compression is carried out along the space dimension, and each two-dimensional feature channel is changed into one pixel; then F is carried out _ex Excitation operations, then weighted by multiplication onto previous features

Calculation formula (5):

wherein: u (U) _C Representing the C-th channel in the feature map; z is Z _C Is the output of the compression operation; calculation formula (6): sigma is a Sigmoid activation function; w (W) ₁ ，W ₂ All are all fully connected operation; delta is a ReLU activation function; calculation formula (7): s is S _C Is the C weight in the step S;

S＝F _ex (Z，W)＝σ(g(Z，W))＝σ(W ₂ δ(W ₁ Z)) (6)

F _scale ＝(U _C ，S _C )＝S _C ·U _C (7)；

the step 3.3.4 is specifically that

The pooling layer of the kernels with the sizes of 1,5, 9 and 13 in the SPP is changed into 1*1 convolution and 3*3 cavity convolution, the improved SPP module does not change the size of the feature map, and the size calculation formula of the output feature map is (8)

The step 3.3.5 is specifically

Adding an improved SPP module into the SE-attention to obtain an SSE-attention module;

the step 4 is specifically that

Taking the degree of overlap IOU between the prior prediction frames of the two overlapped targets as the loss value, optimizing a back propagation network according to the gradient direction, separating the overlapped prior prediction frames of the two targets, defining the prior prediction frames as (9) expressing the matching of different target frames in the formula