CN112508014A

CN112508014A - Improved YOLOv3 target detection method based on attention mechanism

Info

Publication number: CN112508014A
Application number: CN202011396416.1A
Authority: CN
Inventors: 李永胜; 孙长银; 陆科林; 徐乐玏
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2020-12-04
Filing date: 2020-12-04
Publication date: 2021-03-16

Abstract

The invention discloses an improved YOLOv3 target detection method based on an attention mechanism, wherein an attention module SKNet is introduced into a backbone network Darknet-53, the size of a convolution kernel is adaptively adjusted according to input, and the convolution kernel is focused on an interested area; introducing a spatial pyramid pooling module SPP at the top of the feature extraction network to increase the receptive field of the network; and introducing a channel attention module SENET in the feature fusion network, distributing weight to the channel, and fully extracting effective feature information of the channel. Experiments show that compared with the original YOLOv3 model, the method can effectively detect small targets, accelerate the convergence rate of training and improve the detection precision on the premise that the detection speed is not greatly influenced.

Description

Improved YOLOv3 target detection method based on attention mechanism

Technical Field

The invention relates to an improved YOLOv3 target detection method based on an attention mechanism, and belongs to the technical field of target detection in image processing.

Background

Target detection serves as a foundation for image understanding and computer vision, and is the basis for solving more complex and higher-level visual tasks such as segmentation, scene understanding, target tracking, image description, event detection and activity recognition. Target detection has wide application in many fields such as artificial intelligence and information technology, such as security, human-computer interaction, automatic driving, robot vision, consumer electronics, content-based image retrieval, intelligent video monitoring and augmented reality.

Currently, a series of target detection algorithms based on deep learning can be roughly divided into two major genres:

1. two-step (two-stage) algorithm: candidate regions are generated and then CNN classification (R-CNN series) is performed,

2. one-step (one-stage) algorithm: the algorithm is applied directly to the input image and the class and corresponding localization (YOLO series) are output.

Although the accuracy rate of the previous R-CNN series is higher, even if the R-CNN series is developed to fast R-CNN, the detection speed is only 7FPS (original text is 5FPS), and the YOLO series greatly improves the detection speed on the basis of giving consideration to the accuracy rate, so that the detection work can be carried out in a real-time scene. The detection idea of YOLO is different from that of the R-CNN series, and it solves the target detection as a regression task. The YOLO neural network directly predicts the target position and probability from the complete image in one-time prediction, and is an end-to-end network structure.

The YOLOv3 is a target detection method which is applied more currently, and improves YOLO, so that the network is better improved in small target detection and detection precision, the detection speed is not greatly influenced, and the detection real-time requirement is still met. However, YOLOv3 still has the following problems: the accuracy of target positioning is not high; the training convergence speed is low; the small target detection error rate is high.

Disclosure of Invention

The invention aims to provide an improved YOLOv3 target detection method based on an attention mechanism, which can effectively detect small targets to a certain extent, accelerate the convergence speed of training and improve the detection precision on the premise that the detection speed is not greatly influenced.

In order to achieve the purpose, the invention adopts the following technical scheme:

the invention discloses an improved YOLOv3 target detection method based on an attention mechanism, which comprises the following steps:

s1: preprocessing an original image, and normalizing the original image into 416 multiplied by 3 to obtain a training sample;

s2: modifying the network structure of Darknet-53, and introducing an attention mechanism of the size of an adaptive convolution kernel in each residual layer Basic Block module;

s3: introducing a spatial pyramid pooling module SPP at the top of Darknet-53 to increase the receptive field of the feature extraction network;

s4: extracting image features by using an improved Darknet-53 network, and leading out feature maps (feature maps) of three scales from different depths of the network to a feature fusion branch;

s5: a channel attention mechanism is introduced into the three characteristic fusion branches, weights are distributed to the channels, and effective characteristic information of the channels is fully extracted;

s6: and finally, respectively predicting on the three branches to obtain a multi-scale target detection result.

As a further technical solution of the present invention, in the step S1, the preprocessing manner includes random rotation, horizontal inversion and normalization.

As a further technical solution of the present invention, in step S2, the method for introducing the attention mechanism of the adaptive convolution kernel size into the residual layer Basic Block module includes:

s21: inserting a convolution module Selective Kernel Networks with the size of the self-adaptive convolution Kernel after a convolution layer with the size of the first convolution Kernel of 1 multiplied by 1;

s22: the convolution kernel size of the original second convolution layer is modified from 3 × 3 to 1 × 1. The original 3 × 3 convolution is replaced by SKNet, and a 1 × 1 convolution is inserted behind SKNet, so that the modified residual structure is similar to a bottleneck block.

As a further technical solution of the present invention, in the step S3, the method for introducing the spatial pyramid pooling module SPP at the top of the Darknet-53 includes:

s31: 4 branches are led out from the output of the last basic convolution module of Darknet-53;

s32: the first, the second and the third branch respectively pass through the maximum pooling layer a₁、a₂、a₃，a₁Has a kernel size of 5, step size of 1, a₂Has a kernel size of 9, step size of 1, a₃The kernel size of (1) is 13, the step length is 1, and the last branch retains the original output characteristics;

s33: splicing the outputs of the 4 branches on the channel dimension to obtain a new feature map;

s34: and finally, passing the newly obtained feature diagram through a convolution layer to obtain the channel number of the original feature. Keeping the input and output profile dimensions of the SPP block equal.

The SPP module is designed for plug and play, so it is important to keep the dimensions constant, which ensures that SPPs can be plugged anywhere in the network without error.

As a further technical solution of the present invention, in the step S5, the method for introducing the attention mechanism into the feature fusion branch includes:

two branches of 8-time down-sampling and 16-time down-sampling are selected, and after the Upesample at the upper sampling layer and feature graphs output by the two branches are spliced and fused according to channel dimensions, a channel attention mechanism module Squeeze-and-Excitation Networks is inserted. The two branches of 8-time down-sampling and 16-time down-sampling correspond to feature maps with different sizes, and the feature maps with different sizes are fused, so that multi-scale information can be fully utilized, and the detection precision of the target object under different scales is improved. The multi-scale features are obtained by directly splicing according to channel dimensions, information of some channels may have redundancy, weights are distributed to the channels, effective information of the channels can be fully extracted, and redundant information is reduced

Compared with the prior art, the invention has the following beneficial effects: the invention discloses an improved YOLOv3 target detection method based on an attention mechanism, wherein the attention mechanism and a spatial pyramid pooling module SPP with the size of a self-adaptive convolution kernel are introduced into a feature extraction network, the size of a receptive field can be self-adaptively adjusted according to the size of a detected target, the detection target is better focused on an interested area, the positioning precision of the target is improved, and the detection error rate of the small target is reduced; a channel attention mechanism is introduced into the feature fusion branch, meaningful channel feature information in an input image is focused, and the weight of redundant information is reduced; in addition, the invention also accelerates the convergence speed of the model in the training process, and improves the detection precision on the premise that the detection speed is not greatly influenced. Experimental results show that on the premise that the number of model parameters is slightly increased, the precision of the VOC data set and the precision of the COCO data set are obviously improved.

Drawings

FIG. 1 is a flow chart of the improved YOLOv3 target detection method based on attention mechanism of the present invention;

FIG. 2 is a diagram showing a comparison of the structure of residual modules of the present invention, wherein (A) is the original residual module and (B) is the residual module after the attention mechanism is introduced;

FIG. 3 is a diagram of a Selective Kernel Networks network architecture for use with the present invention;

FIG. 4 is a block SPP of the present invention incorporating spatial pyramid pooling;

FIG. 5 is a diagram of a feature fusion bypass network architecture after an attention-calling mechanism of the present invention;

FIG. 6 is a channel attention module for use with the present invention.

Detailed Description

The technical solution of the present invention will be further described with reference to the following detailed description and accompanying drawings.

Example 1: the specific embodiment discloses an improved YOLOv3 target detection method based on an attention mechanism, as shown in fig. 1 to 6, comprising the following steps:

s1: preprocessing an original image, and normalizing the original image into 416 multiplied by 3 to obtain a training sample, wherein the preprocessing mode comprises random rotation (-30 degrees to 30 degrees), horizontal turnover (50 percent of probability) and standardization processing;

s2: the characteristic extraction network Darknet-53 is composed of a large number of residual modules Basic Block, the convolution is adopted to finish down-sampling, the network structure of the Darknet-53 is modified, and an attention mechanism of the size of a self-adaptive convolution kernel is introduced into each residual layer Basic Block module, so that the network can automatically adjust the size of a receptive field according to the size of a detection target and better focus on an interested area;

s4: extracting image features by using the improved Darknet-53 network, and leading out the features from the feature maps of three different scales of 32 times, 16 times and 8 times of down-sampling of the network to a feature fusion branch for respectively detecting the targets with different sizes, so that the detection method can have better detection effect on the targets with different scales;

s5: introducing a channel attention mechanism into the three characteristic fusion branches;

s6: and finally, forecasting on the three branches respectively to forecast the position of the target and the confidence coefficient of the category of the target, so as to obtain a multi-scale target detection result.

In step S2, the method for introducing the attention mechanism into the residual layer Basic Block module is as follows:

s21: inserting a convolution module Selective Kernel Networks with the adaptive convolution Kernel size after a convolution layer with the first convolution Kernel size of 1 multiplied by 1, and introducing an attention mechanism; the structure diagram of the SKNet is shown in figure 3, the input is firstly processed by convolution layers with different convolution kernel sizes, the two outputs are added point by point and then subjected to global average pooling, then the two outputs pass through a full connection layer, the obtained channel information is divided into two sub-vectors A and B through Softmax, the two sub-vectors are respectively multiplied with the convolution output of the first step, and finally the two feature vectors are added point by point to obtain a final output result;

s22: the convolution kernel size of the original second convolutional layer is modified from 3 × 3 to 1 × 1, so that the modified residual layer is similar to the BottleNeck module.

As shown in fig. 2, fig. 2(a) is the original residual Block Basic Block, and fig. 2(B) is the residual Block after SKNet is introduced, which is similar to BottleNeck.

In step S3, the method for introducing the spatial pyramid pooling module SPP at the top of the Darknet-53 is as follows:

s32: a first, a second,The third branch passes through the maximum pooling layer a₁、a₂、a₃，a₁Has a kernel size of 5, step size of 1, a₂Has a kernel size of 9, step size of 1, a₃The kernel size of (1) is 13, the step length is 1, and the last branch retains the original output characteristics;

s34: finally, the newly obtained feature graph passes through a convolution layer to obtain the channel number of the original feature;

the SPP network structure is shown in fig. 4.

In step S5, the network structure diagram after the attention mechanism is introduced into the feature fusion branch is shown in fig. 4, that is, after the Upsample and the feature diagram of the corresponding scale are merged and fused according to the channel dimension concat, a channel attention mechanism module Squeeze-and-Excitation Networks is inserted, and the weight distribution of the channels is adjusted, so that the channel information after the feature fusion is more effective.

FIG. 6 is a block diagram of SENEt, which does not change the dimension of the input feature vector; firstly, performing global average pooling on the feature vectors, and obtaining channel information when the dimension is changed into 1 multiplied by C; then, a full connection layer and a ReLU activation function are passed; then, obtaining the weight of the channel through a full connection layer and a Sigmoid function; and finally, multiplying the weight distributed by the channel by the input feature vector to obtain the output feature after the attention of the channel.

The above description is only an embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can understand that the modifications or substitutions within the technical scope of the present invention are included in the scope of the present invention, and therefore, the scope of the present invention should be subject to the protection scope of the claims.

Claims

1. An improved Yolov3 target detection method based on an attention mechanism is characterized by comprising the following steps:

2. The improved YOLOv3 target detection method based on attention mechanism as claimed in claim 1, wherein the preprocessing comprises random rotation, horizontal inversion and normalization in step S1.

3. The improved YOLOv3 target detection method based on attention mechanism as claimed in claim 1, wherein in step S2, the method of introducing the attention mechanism of adaptive convolution kernel size in the residual layer Basic Block module is:

s21: inserting a convolution module Selective Kernel Networks (SKNet) with the adaptive convolution Kernel size after a convolution layer with the first convolution Kernel size of 1 multiplied by 1;

s22: the convolution kernel size of the original second convolution layer is modified from 3 × 3 to 1 × 1.

4. The improved YOLOv3 target detection method based on attention mechanism as claimed in claim 1, wherein in step S3, the method of introducing the spatial pyramid pooling module SPP at the top of the Darknet-53 is:

s34: and finally, passing the newly obtained feature diagram through a convolution layer to obtain the channel number of the original feature.

5. The improved YOLOv3 target detection method based on attention mechanism as claimed in claim 1, wherein in step S5, the method for introducing the channel attention mechanism in the feature fusion branch is:

two branches of 8-time down-sampling and 16-time down-sampling are selected, and after the Upsample at the upper sampling layer and the feature maps output by the two branches are spliced and fused according to the channel dimension, a channel attention mechanism module Squeeze-and-Excitation Networks (SEnet) is inserted.