CN112508014A - Improved YOLOv3 target detection method based on attention mechanism - Google Patents

Improved YOLOv3 target detection method based on attention mechanism Download PDF

Info

Publication number
CN112508014A
CN112508014A CN202011396416.1A CN202011396416A CN112508014A CN 112508014 A CN112508014 A CN 112508014A CN 202011396416 A CN202011396416 A CN 202011396416A CN 112508014 A CN112508014 A CN 112508014A
Authority
CN
China
Prior art keywords
attention mechanism
feature
channel
target detection
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011396416.1A
Other languages
Chinese (zh)
Inventor
李永胜
孙长银
陆科林
徐乐玏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN202011396416.1A priority Critical patent/CN112508014A/en
Publication of CN112508014A publication Critical patent/CN112508014A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Abstract

The invention discloses an improved YOLOv3 target detection method based on an attention mechanism, wherein an attention module SKNet is introduced into a backbone network Darknet-53, the size of a convolution kernel is adaptively adjusted according to input, and the convolution kernel is focused on an interested area; introducing a spatial pyramid pooling module SPP at the top of the feature extraction network to increase the receptive field of the network; and introducing a channel attention module SENET in the feature fusion network, distributing weight to the channel, and fully extracting effective feature information of the channel. Experiments show that compared with the original YOLOv3 model, the method can effectively detect small targets, accelerate the convergence rate of training and improve the detection precision on the premise that the detection speed is not greatly influenced.

Description

Improved YOLOv3 target detection method based on attention mechanism
Technical Field
The invention relates to an improved YOLOv3 target detection method based on an attention mechanism, and belongs to the technical field of target detection in image processing.
Background
Target detection serves as a foundation for image understanding and computer vision, and is the basis for solving more complex and higher-level visual tasks such as segmentation, scene understanding, target tracking, image description, event detection and activity recognition. Target detection has wide application in many fields such as artificial intelligence and information technology, such as security, human-computer interaction, automatic driving, robot vision, consumer electronics, content-based image retrieval, intelligent video monitoring and augmented reality.
Currently, a series of target detection algorithms based on deep learning can be roughly divided into two major genres:
1. two-step (two-stage) algorithm: candidate regions are generated and then CNN classification (R-CNN series) is performed,
2. one-step (one-stage) algorithm: the algorithm is applied directly to the input image and the class and corresponding localization (YOLO series) are output.
Although the accuracy rate of the previous R-CNN series is higher, even if the R-CNN series is developed to fast R-CNN, the detection speed is only 7FPS (original text is 5FPS), and the YOLO series greatly improves the detection speed on the basis of giving consideration to the accuracy rate, so that the detection work can be carried out in a real-time scene. The detection idea of YOLO is different from that of the R-CNN series, and it solves the target detection as a regression task. The YOLO neural network directly predicts the target position and probability from the complete image in one-time prediction, and is an end-to-end network structure.
The YOLOv3 is a target detection method which is applied more currently, and improves YOLO, so that the network is better improved in small target detection and detection precision, the detection speed is not greatly influenced, and the detection real-time requirement is still met. However, YOLOv3 still has the following problems: the accuracy of target positioning is not high; the training convergence speed is low; the small target detection error rate is high.
Disclosure of Invention
The invention aims to provide an improved YOLOv3 target detection method based on an attention mechanism, which can effectively detect small targets to a certain extent, accelerate the convergence speed of training and improve the detection precision on the premise that the detection speed is not greatly influenced.
In order to achieve the purpose, the invention adopts the following technical scheme:
the invention discloses an improved YOLOv3 target detection method based on an attention mechanism, which comprises the following steps:
s1: preprocessing an original image, and normalizing the original image into 416 multiplied by 3 to obtain a training sample;
s2: modifying the network structure of Darknet-53, and introducing an attention mechanism of the size of an adaptive convolution kernel in each residual layer Basic Block module;
s3: introducing a spatial pyramid pooling module SPP at the top of Darknet-53 to increase the receptive field of the feature extraction network;
s4: extracting image features by using an improved Darknet-53 network, and leading out feature maps (feature maps) of three scales from different depths of the network to a feature fusion branch;
s5: a channel attention mechanism is introduced into the three characteristic fusion branches, weights are distributed to the channels, and effective characteristic information of the channels is fully extracted;
s6: and finally, respectively predicting on the three branches to obtain a multi-scale target detection result.
As a further technical solution of the present invention, in the step S1, the preprocessing manner includes random rotation, horizontal inversion and normalization.
As a further technical solution of the present invention, in step S2, the method for introducing the attention mechanism of the adaptive convolution kernel size into the residual layer Basic Block module includes:
s21: inserting a convolution module Selective Kernel Networks with the size of the self-adaptive convolution Kernel after a convolution layer with the size of the first convolution Kernel of 1 multiplied by 1;
s22: the convolution kernel size of the original second convolution layer is modified from 3 × 3 to 1 × 1. The original 3 × 3 convolution is replaced by SKNet, and a 1 × 1 convolution is inserted behind SKNet, so that the modified residual structure is similar to a bottleneck block.
As a further technical solution of the present invention, in the step S3, the method for introducing the spatial pyramid pooling module SPP at the top of the Darknet-53 includes:
s31: 4 branches are led out from the output of the last basic convolution module of Darknet-53;
s32: the first, the second and the third branch respectively pass through the maximum pooling layer a1、a2、a3,a1Has a kernel size of 5, step size of 1, a2Has a kernel size of 9, step size of 1, a3The kernel size of (1) is 13, the step length is 1, and the last branch retains the original output characteristics;
s33: splicing the outputs of the 4 branches on the channel dimension to obtain a new feature map;
s34: and finally, passing the newly obtained feature diagram through a convolution layer to obtain the channel number of the original feature. Keeping the input and output profile dimensions of the SPP block equal.
The SPP module is designed for plug and play, so it is important to keep the dimensions constant, which ensures that SPPs can be plugged anywhere in the network without error.
As a further technical solution of the present invention, in the step S5, the method for introducing the attention mechanism into the feature fusion branch includes:
two branches of 8-time down-sampling and 16-time down-sampling are selected, and after the Upesample at the upper sampling layer and feature graphs output by the two branches are spliced and fused according to channel dimensions, a channel attention mechanism module Squeeze-and-Excitation Networks is inserted. The two branches of 8-time down-sampling and 16-time down-sampling correspond to feature maps with different sizes, and the feature maps with different sizes are fused, so that multi-scale information can be fully utilized, and the detection precision of the target object under different scales is improved. The multi-scale features are obtained by directly splicing according to channel dimensions, information of some channels may have redundancy, weights are distributed to the channels, effective information of the channels can be fully extracted, and redundant information is reduced
Compared with the prior art, the invention has the following beneficial effects: the invention discloses an improved YOLOv3 target detection method based on an attention mechanism, wherein the attention mechanism and a spatial pyramid pooling module SPP with the size of a self-adaptive convolution kernel are introduced into a feature extraction network, the size of a receptive field can be self-adaptively adjusted according to the size of a detected target, the detection target is better focused on an interested area, the positioning precision of the target is improved, and the detection error rate of the small target is reduced; a channel attention mechanism is introduced into the feature fusion branch, meaningful channel feature information in an input image is focused, and the weight of redundant information is reduced; in addition, the invention also accelerates the convergence speed of the model in the training process, and improves the detection precision on the premise that the detection speed is not greatly influenced. Experimental results show that on the premise that the number of model parameters is slightly increased, the precision of the VOC data set and the precision of the COCO data set are obviously improved.
Drawings
FIG. 1 is a flow chart of the improved YOLOv3 target detection method based on attention mechanism of the present invention;
FIG. 2 is a diagram showing a comparison of the structure of residual modules of the present invention, wherein (A) is the original residual module and (B) is the residual module after the attention mechanism is introduced;
FIG. 3 is a diagram of a Selective Kernel Networks network architecture for use with the present invention;
FIG. 4 is a block SPP of the present invention incorporating spatial pyramid pooling;
FIG. 5 is a diagram of a feature fusion bypass network architecture after an attention-calling mechanism of the present invention;
FIG. 6 is a channel attention module for use with the present invention.
Detailed Description
The technical solution of the present invention will be further described with reference to the following detailed description and accompanying drawings.
Example 1: the specific embodiment discloses an improved YOLOv3 target detection method based on an attention mechanism, as shown in fig. 1 to 6, comprising the following steps:
s1: preprocessing an original image, and normalizing the original image into 416 multiplied by 3 to obtain a training sample, wherein the preprocessing mode comprises random rotation (-30 degrees to 30 degrees), horizontal turnover (50 percent of probability) and standardization processing;
s2: the characteristic extraction network Darknet-53 is composed of a large number of residual modules Basic Block, the convolution is adopted to finish down-sampling, the network structure of the Darknet-53 is modified, and an attention mechanism of the size of a self-adaptive convolution kernel is introduced into each residual layer Basic Block module, so that the network can automatically adjust the size of a receptive field according to the size of a detection target and better focus on an interested area;
s3: introducing a spatial pyramid pooling module SPP at the top of Darknet-53 to increase the receptive field of the feature extraction network;
s4: extracting image features by using the improved Darknet-53 network, and leading out the features from the feature maps of three different scales of 32 times, 16 times and 8 times of down-sampling of the network to a feature fusion branch for respectively detecting the targets with different sizes, so that the detection method can have better detection effect on the targets with different scales;
s5: introducing a channel attention mechanism into the three characteristic fusion branches;
s6: and finally, forecasting on the three branches respectively to forecast the position of the target and the confidence coefficient of the category of the target, so as to obtain a multi-scale target detection result.
In step S2, the method for introducing the attention mechanism into the residual layer Basic Block module is as follows:
s21: inserting a convolution module Selective Kernel Networks with the adaptive convolution Kernel size after a convolution layer with the first convolution Kernel size of 1 multiplied by 1, and introducing an attention mechanism; the structure diagram of the SKNet is shown in figure 3, the input is firstly processed by convolution layers with different convolution kernel sizes, the two outputs are added point by point and then subjected to global average pooling, then the two outputs pass through a full connection layer, the obtained channel information is divided into two sub-vectors A and B through Softmax, the two sub-vectors are respectively multiplied with the convolution output of the first step, and finally the two feature vectors are added point by point to obtain a final output result;
s22: the convolution kernel size of the original second convolutional layer is modified from 3 × 3 to 1 × 1, so that the modified residual layer is similar to the BottleNeck module.
As shown in fig. 2, fig. 2(a) is the original residual Block Basic Block, and fig. 2(B) is the residual Block after SKNet is introduced, which is similar to BottleNeck.
In step S3, the method for introducing the spatial pyramid pooling module SPP at the top of the Darknet-53 is as follows:
s31: 4 branches are led out from the output of the last basic convolution module of Darknet-53;
s32: a first, a second,The third branch passes through the maximum pooling layer a1、a2、a3,a1Has a kernel size of 5, step size of 1, a2Has a kernel size of 9, step size of 1, a3The kernel size of (1) is 13, the step length is 1, and the last branch retains the original output characteristics;
s33: splicing the outputs of the 4 branches on the channel dimension to obtain a new feature map;
s34: finally, the newly obtained feature graph passes through a convolution layer to obtain the channel number of the original feature;
the SPP network structure is shown in fig. 4.
In step S5, the network structure diagram after the attention mechanism is introduced into the feature fusion branch is shown in fig. 4, that is, after the Upsample and the feature diagram of the corresponding scale are merged and fused according to the channel dimension concat, a channel attention mechanism module Squeeze-and-Excitation Networks is inserted, and the weight distribution of the channels is adjusted, so that the channel information after the feature fusion is more effective.
FIG. 6 is a block diagram of SENEt, which does not change the dimension of the input feature vector; firstly, performing global average pooling on the feature vectors, and obtaining channel information when the dimension is changed into 1 multiplied by C; then, a full connection layer and a ReLU activation function are passed; then, obtaining the weight of the channel through a full connection layer and a Sigmoid function; and finally, multiplying the weight distributed by the channel by the input feature vector to obtain the output feature after the attention of the channel.
The above description is only an embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can understand that the modifications or substitutions within the technical scope of the present invention are included in the scope of the present invention, and therefore, the scope of the present invention should be subject to the protection scope of the claims.

Claims (5)

1. An improved Yolov3 target detection method based on an attention mechanism is characterized by comprising the following steps:
s1: preprocessing an original image, and normalizing the original image into 416 multiplied by 3 to obtain a training sample;
s2: modifying the network structure of Darknet-53, and introducing an attention mechanism of the size of an adaptive convolution kernel in each residual layer Basic Block module;
s3: introducing a spatial pyramid pooling module SPP at the top of Darknet-53 to increase the receptive field of the feature extraction network;
s4: extracting image features by using an improved Darknet-53 network, and leading out feature maps (feature maps) of three scales from different depths of the network to a feature fusion branch;
s5: a channel attention mechanism is introduced into the three characteristic fusion branches, weights are distributed to the channels, and effective characteristic information of the channels is fully extracted;
s6: and finally, respectively predicting on the three branches to obtain a multi-scale target detection result.
2. The improved YOLOv3 target detection method based on attention mechanism as claimed in claim 1, wherein the preprocessing comprises random rotation, horizontal inversion and normalization in step S1.
3. The improved YOLOv3 target detection method based on attention mechanism as claimed in claim 1, wherein in step S2, the method of introducing the attention mechanism of adaptive convolution kernel size in the residual layer Basic Block module is:
s21: inserting a convolution module Selective Kernel Networks (SKNet) with the adaptive convolution Kernel size after a convolution layer with the first convolution Kernel size of 1 multiplied by 1;
s22: the convolution kernel size of the original second convolution layer is modified from 3 × 3 to 1 × 1.
4. The improved YOLOv3 target detection method based on attention mechanism as claimed in claim 1, wherein in step S3, the method of introducing the spatial pyramid pooling module SPP at the top of the Darknet-53 is:
s31: 4 branches are led out from the output of the last basic convolution module of Darknet-53;
s32: the first, the second and the third branch respectively pass through the maximum pooling layer a1、a2、a3,a1Has a kernel size of 5, step size of 1, a2Has a kernel size of 9, step size of 1, a3The kernel size of (1) is 13, the step length is 1, and the last branch retains the original output characteristics;
s33: splicing the outputs of the 4 branches on the channel dimension to obtain a new feature map;
s34: and finally, passing the newly obtained feature diagram through a convolution layer to obtain the channel number of the original feature.
5. The improved YOLOv3 target detection method based on attention mechanism as claimed in claim 1, wherein in step S5, the method for introducing the channel attention mechanism in the feature fusion branch is:
two branches of 8-time down-sampling and 16-time down-sampling are selected, and after the Upsample at the upper sampling layer and the feature maps output by the two branches are spliced and fused according to the channel dimension, a channel attention mechanism module Squeeze-and-Excitation Networks (SEnet) is inserted.
CN202011396416.1A 2020-12-04 2020-12-04 Improved YOLOv3 target detection method based on attention mechanism Pending CN112508014A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011396416.1A CN112508014A (en) 2020-12-04 2020-12-04 Improved YOLOv3 target detection method based on attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011396416.1A CN112508014A (en) 2020-12-04 2020-12-04 Improved YOLOv3 target detection method based on attention mechanism

Publications (1)

Publication Number Publication Date
CN112508014A true CN112508014A (en) 2021-03-16

Family

ID=74969561

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011396416.1A Pending CN112508014A (en) 2020-12-04 2020-12-04 Improved YOLOv3 target detection method based on attention mechanism

Country Status (1)

Country Link
CN (1) CN112508014A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112990325A (en) * 2021-03-24 2021-06-18 南通大学 Light network construction method for embedded real-time visual target detection
CN113111828A (en) * 2021-04-23 2021-07-13 中国科学院宁波材料技术与工程研究所 Three-dimensional defect detection method and system for bearing
CN113223044A (en) * 2021-04-21 2021-08-06 西北工业大学 Infrared video target detection method combining feature aggregation and attention mechanism
CN113378672A (en) * 2021-05-31 2021-09-10 扬州大学 Multi-target detection method for defects of power transmission line based on improved YOLOv3
CN113393438A (en) * 2021-06-15 2021-09-14 哈尔滨理工大学 Resin lens defect detection method based on convolutional neural network
CN113837275A (en) * 2021-09-24 2021-12-24 南京邮电大学 Improved YOLOv3 target detection method based on expanded coordinate attention
CN113902735A (en) * 2021-09-13 2022-01-07 云南春芯科技有限公司 Crop disease identification method and device, electronic equipment and storage medium
CN114724022A (en) * 2022-03-04 2022-07-08 大连海洋大学 Culture fish school detection method, system and medium fusing SKNet and YOLOv5
CN114724022B (en) * 2022-03-04 2024-05-10 大连海洋大学 Method, system and medium for detecting farmed fish shoal by fusing SKNet and YOLOv5

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200057935A1 (en) * 2017-03-23 2020-02-20 Peking University Shenzhen Graduate School Video action detection method based on convolutional neural network
CN111079584A (en) * 2019-12-03 2020-04-28 东华大学 Rapid vehicle detection method based on improved YOLOv3
CN111814621A (en) * 2020-06-29 2020-10-23 中国科学院合肥物质科学研究院 Multi-scale vehicle and pedestrian detection method and device based on attention mechanism

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200057935A1 (en) * 2017-03-23 2020-02-20 Peking University Shenzhen Graduate School Video action detection method based on convolutional neural network
CN111079584A (en) * 2019-12-03 2020-04-28 东华大学 Rapid vehicle detection method based on improved YOLOv3
CN111814621A (en) * 2020-06-29 2020-10-23 中国科学院合肥物质科学研究院 Multi-scale vehicle and pedestrian detection method and device based on attention mechanism

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ASHERGAGA: "【论文解读】SKNet网络(自适应调整感受野尺寸", pages 1 - 8, Retrieved from the Internet <URL:https://zhuanlan.zhihu.com/p/80513438> *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112990325A (en) * 2021-03-24 2021-06-18 南通大学 Light network construction method for embedded real-time visual target detection
CN113223044A (en) * 2021-04-21 2021-08-06 西北工业大学 Infrared video target detection method combining feature aggregation and attention mechanism
CN113111828A (en) * 2021-04-23 2021-07-13 中国科学院宁波材料技术与工程研究所 Three-dimensional defect detection method and system for bearing
CN113378672A (en) * 2021-05-31 2021-09-10 扬州大学 Multi-target detection method for defects of power transmission line based on improved YOLOv3
CN113393438A (en) * 2021-06-15 2021-09-14 哈尔滨理工大学 Resin lens defect detection method based on convolutional neural network
CN113393438B (en) * 2021-06-15 2022-09-16 哈尔滨理工大学 Resin lens defect detection method based on convolutional neural network
CN113902735A (en) * 2021-09-13 2022-01-07 云南春芯科技有限公司 Crop disease identification method and device, electronic equipment and storage medium
CN113837275A (en) * 2021-09-24 2021-12-24 南京邮电大学 Improved YOLOv3 target detection method based on expanded coordinate attention
CN113837275B (en) * 2021-09-24 2023-10-17 南京邮电大学 Improved YOLOv3 target detection method based on expanded coordinate attention
CN114724022A (en) * 2022-03-04 2022-07-08 大连海洋大学 Culture fish school detection method, system and medium fusing SKNet and YOLOv5
CN114724022B (en) * 2022-03-04 2024-05-10 大连海洋大学 Method, system and medium for detecting farmed fish shoal by fusing SKNet and YOLOv5

Similar Documents

Publication Publication Date Title
CN112508014A (en) Improved YOLOv3 target detection method based on attention mechanism
CN109344725B (en) Multi-pedestrian online tracking method based on space-time attention mechanism
CN109800689B (en) Target tracking method based on space-time feature fusion learning
CN111709311B (en) Pedestrian re-identification method based on multi-scale convolution feature fusion
CN111079646A (en) Method and system for positioning weak surveillance video time sequence action based on deep learning
CN113673510B (en) Target detection method combining feature point and anchor frame joint prediction and regression
CN108520203B (en) Multi-target feature extraction method based on fusion of self-adaptive multi-peripheral frame and cross pooling feature
CN112434599B (en) Pedestrian re-identification method based on random occlusion recovery of noise channel
CN112084911B (en) Human face feature point positioning method and system based on global attention
CN112434723B (en) Day/night image classification and object detection method based on attention network
CN111723660A (en) Detection method for long ground target detection network
CN111259837A (en) Pedestrian re-identification method and system based on part attention
CN113298817A (en) High-accuracy semantic segmentation method for remote sensing image
CN113420827A (en) Semantic segmentation network training and image semantic segmentation method, device and equipment
CN116258990A (en) Cross-modal affinity-based small sample reference video target segmentation method
Kadim et al. Deep-learning based single object tracker for night surveillance.
CN116740516A (en) Target detection method and system based on multi-scale fusion feature extraction
CN117079095A (en) Deep learning-based high-altitude parabolic detection method, system, medium and equipment
Rao et al. Roads detection of aerial image with FCN-CRF model
CN114120076B (en) Cross-view video gait recognition method based on gait motion estimation
CN113159071B (en) Cross-modal image-text association anomaly detection method
CN114782360A (en) Real-time tomato posture detection method based on DCT-YOLOv5 model
CN114140524A (en) Closed loop detection system and method for multi-scale feature fusion
Xia et al. Multi-RPN Fusion-Based Sparse PCA-CNN Approach to Object Detection and Recognition for Robot-Aided Visual System
Han Comparison on object detection algorithms: A taxonomy

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination