CN117456167A - Target detection algorithm based on improved YOLOv8s - Google Patents

Target detection algorithm based on improved YOLOv8s Download PDF

Info

Publication number
CN117456167A
CN117456167A CN202311436139.6A CN202311436139A CN117456167A CN 117456167 A CN117456167 A CN 117456167A CN 202311436139 A CN202311436139 A CN 202311436139A CN 117456167 A CN117456167 A CN 117456167A
Authority
CN
China
Prior art keywords
target
convolution
frame
image
input
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311436139.6A
Other languages
Chinese (zh)
Inventor
邵叶秦
王梓腾
吕昌
张若为
杨国青
许长勇
冯林威
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nantong University
Original Assignee
Nantong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nantong University filed Critical Nantong University
Priority to CN202311436139.6A priority Critical patent/CN117456167A/en
Publication of CN117456167A publication Critical patent/CN117456167A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/54Extraction of image or video features relating to texture
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/766Arrangements for image or video recognition or understanding using pattern recognition or machine learning using regression, e.g. by projecting features on hyperplanes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/771Feature selection, e.g. selecting representative features from a multi-dimensional feature space
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/7715Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a target detection algorithm based on improved YOLOv8s, a model is retrained on a VOC2012 data set, and a main body part of a network comprises input, data preprocessing, backbone, neck and output. The method enhances the gradient flow information in the model, better extracts the spatial characteristic information among different sizes, can capture the edge, texture and other characteristics in the image, also retains the most obvious characteristics in the image, learns more to multi-scale and multi-layer characteristic representation, enables the model to adaptively adjust the characteristic weights of different positions, focuses more on the target area, reduces the background interference, improves the accurate positioning capability of the target, optimizes the flow of the gradient in the counter propagation process, reduces the gradient disappearance phenomenon, accelerates the convergence speed, fully utilizes the context information, and improves the detection and recognition capability of the model on the target.

Description

Target detection algorithm based on improved YOLOv8s
Technical Field
The invention relates to the field of computer vision and target detection, in particular to a target detection algorithm based on improved YOLOv8 s.
Background
The object detection algorithm based on deep learning has made remarkable progress in the field of computer vision. These models exhibit more powerful performance and potential in target detection tasks than conventional approaches.
The core of the object detection task is to locate and classify objects in the image. The target detection algorithm based on deep learning is mainly divided into Two types, namely Two-stage and One-stage. The Two-stage target detection algorithm firstly generates a region, then carries out target classification through a convolutional neural network, and comprises the following steps: R-CNN, mask R-CNN, SPPNet, fast R-CNN, etc. The One-stage target detection algorithm is directly positioned and classified through a convolutional neural network, a candidate region generation process is not needed, and the common One-stage target detection algorithm comprises a YOLO series, an SSD, a DSSD, an FSSD and the like.
YOLOv8 is the latest version of the YOLOv series algorithm, and the YOLOv8 is remarkably improved in real-time performance and prediction accuracy after multiple iterations and optimization. Five different network models, YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l and YOLOv8x, are included in YOLOv8, and balance between speed and accuracy according to the difference of scaling factors.
Object detection algorithms are prone to false and missed detection when dealing with small-sized objects and dense objects, meaning that the algorithm may in these cases falsely mark non-existing objects or fail to detect existing objects correctly. Thus, the accuracy of the target detection algorithm still leaves room for improvement.
The VOC2012 dataset encompasses multiple target categories, but the number of samples between categories is not uniform, resulting in a category imbalance problem that may make the algorithm overly dependent on certain categories while ignoring others.
In complex scenarios, objects often experience occlusion or overlap, which presents a significant challenge to the object detection algorithm. This situation may cause errors in the algorithm detection of the target, thereby affecting the accuracy of the algorithm.
Disclosure of Invention
The invention aims to: the invention overcomes the defects of the prior art, and provides the target detection algorithm based on the improved YOLOv8s, compared with the original YOLOv8s target detection algorithm, the improved algorithm provided by the invention has better performance on the common data set VOC 2012.
The invention adopts the technical scheme that: an improved YOLOv8 s-based target detection algorithm, comprising the steps of:
1) The input to the network is an image with a resolution of 640 x 3.
2) A data preprocessing section: using the VOC2012 dataset, four enhancements of Mosaic enhancement (Mosaic), mix enhancement (mix up), spatial perturbation (Random Perspective), and color perturbation (HSV Augment) are mainly employed in training.
3) The preprocessed image is input to a backstbone part of a network, and the image is subjected to downsampling processing by using a NewConv module in a shallow network information extraction part.
4) The image feature information extracted by the Backbone part is input to the neg part of the network, and image downsampling and feature extraction are also performed by using NewConv and C2f modules to extract higher-level semantic information. On the basis of Feature Pyramid Network (FPN) path from top to bottom, a bottom-up path structure Path Aggregation Feature PyramidNetwork (PAFPN) is introduced for fusing features of different levels, so that the multi-scale expression capability of the features is improved.
5) The image is predicted by using the improved novel decoupling heads PSA_coupled head and Anchor-Free mode, including the position regression parameters and the category prediction probability of the target detection frame. Firstly, adding a 3X 3 DWConv at the beginning of a classification branch and a regression branch of a decoupling head, independently operating in the dimension of an input channel, learning stronger characteristic expression capacity through fewer parameters, better capturing the spatial correlation of input data, and generating more accurate prediction results on classification and regression tasks. And then, adding the PSA attention to the regression branch, so that the model learns importance weights of different feature areas, and adjusting the output of the regression branch according to the weights, so that the model pays more attention to feature information which is more important for target positioning.
6) Associating each predicted frame with a corresponding real frame using a matching policy (Task Aligned Assigner) for task alignment learning, calculating IoU for the predicted frame and the real frame, and assigning a target class to each predicted frame, ensuring that the assignment of the allocator between the class and the target frame is consistent, i.e., maintaining task alignment; and finally, using non-maximum suppression (NMS) to the predicted target detection frame, and removing the detection frame with the confidence interval not meeting the requirement to obtain the optimal target detection frame.
Further, the step 3) specifically includes:
(1) The NewConv module comprises two branches, and after dimension transformation is carried out on the left branch through 1X 1 convolution, downsampling of the picture is realized by using MaxPool2 d; the right branch realizes the downsampling function through 3×3 convolution with a stride of 2; and performing add splicing after left and right branches, and performing further feature extraction and conversion by 3×3 convolution with a stride of 1, thereby increasing the representation capability and nonlinear transformation capability of the network.
(2) And inputting the downsampled image data into a C2f module for feature extraction, wherein the use ratio of the C2f module of the back bone part is 3:6:6:3. Compared with a C3 structure of YOLOv5, the C2f module carries out convolution operation on input through 1X 1 convolution to obtain a feature map, and then the feature map is divided into two branches according to channel dimension through split functions respectively to construct a list; the last element in the copy list is taken as input, the feature extraction operation is carried out through 3 serially connected bottlenecks, the last element is added into the list, wherein the first Bottleneck takes one branch as input, and the output of each Bottleneck can be taken as the next input. And finally, splicing all branches through a Concat function, and adjusting the number of channels by using 1X 1 convolution to promote the interaction and flow of information among the channels of the feature map.
(3) The extracted image features are input into the improved spatial pyramid pooling layer SPPFC2FC structure. SPPFC2FC architecture embeds SPPF into C2f module: after the above C2f operation, 3 MaxPool2d operations with k=5 are connected in series, and the output of each MaxPool2d operation is taken as the next input. The 3 MaxPool2d branches and the same level of shortcut branches are then spliced using a Concat function and the spatial features are integrated using a 1 x 1 convolution, adjusting the number of channels. The model can better acquire multi-scale target information, aggregate the characteristics on different receptive fields, improve the robustness of the model, and effectively avoid the problem of fixed size of the input picture of the model.
(4) The multi-scale feature information output by the spatial pyramid pooling layer is conveyed to the improved gating convolution structure (eca_gatedconv) provided by the invention to selectively weight the input features so that the network focuses more on important features and suppresses irrelevant or redundant features. First, feature extraction is performed on input data using 3×3 convolution, and the channel dimension is extended to 2 times of the input data. Dividing the obtained characteristic data into a left branch and a right branch according to the channel dimension by using split operation, limiting the input characteristic within a range of 0-1 by using a sigmoid function on the left branch, and obtaining a characteristic diagram for controlling the weight; the right branch further uses a 3 x 3 convolution for feature extraction and then uses a Exponential Linear Unit (ELU) activation function to activate the feature. The feature map obtained by the right branch is multiplied by the feature map of the control weight obtained by the left branch pixel by pixel, and a soft mask (mask) is learned from the data. Finally, introducing a Efficient Channel Attention (ECA) attention mechanism, wherein ECA generates channel weights by carrying out one-dimensional convolution with a kernel size of k on the aggregated features obtained by global average pooling, wherein k is adaptively determined through mapping of channel dimension C,
|t| odd the odd number nearest t is shown, and γ and b are set to 2 and 1 in the experiment.
Further, the step 5) specifically includes:
PSA attention first groups the input data into S groups per channel. Each group performs a group convolution operation using convolution kernels of different sizes (e.g., k=3, 5,7, 9), the number of groupsThe operation obtains receptive fields with different scales in a light-weight mode, and extracts information with different scales from the image. And extracting the weighted value of the channel in each group through an SE Weight module, and finally carrying out softmax normalization and weighting on the weighted value of the S group to adjust the information weights of different scales on each feature map. Through weighting processing, the model can pay more attention to the characteristic area which is more critical to target positioning, and the performance and accuracy of target detection are improved.
Further, the step 6) specifically includes:
selecting a group of prediction frames with the maximum t value as positive samples according to the classification and regression scores, and taking the rest of prediction frames as negative samples, wherein the formula is as follows:
t=s α ×u β (3)
wherein alpha and beta are weight super parameters, s is a prediction value corresponding to the labeling category, u is a combination ratio IoU of the prediction frame and the real frame, the alignment degree can be measured by multiplying the two, and t can simultaneously control the classification score and the optimization of IoU to realize task alignment.
Classification branching uses VFL Loss (varical Loss) as a Loss function, which is formulated as:
where p is IoU-aware classification score (IACS) of the predicted image and q is the target score. When the predicted frame is a positive sample, q is IoU of the target predicted frame and the real frame, the algorithm uses a common Loss function BCE Loss (Binary Cross Entropy Loss), and adds an adaptive IOU weight for highlighting the positive sample; when the prediction block is a negative sample, q=0, and the algorithm uses the Loss function Focal Loss to solve the problem of imbalance of the positive and negative samples.
Regression branching uses CIOU Loss and DFL Loss (Distribution Focal Loss) together as a Loss function, where CIOU is formulated as:
wherein ρ is 2 (b,b gt ) Representing the Euclidean distance between the model predictive frame and the real frame, c represents the length of the minimum circumscribed rectangular diagonal line of the model predictive frame and the real frame, alpha is a parameter for balancing the proportion, and v is used for measuring the proportion consistency between the anchor frame and the target frame. And the formula of DFL Loss is as follows:
DFL(S i ,S i+1 )=-((y i+1 -y)log(S i )-(y-y i )log(S i +1)) (8)
y is the target label value, y i ,y i+1 Is two integers y closest to y i ≤y≤y i+1 The DFL optimizes probability distribution of the target y position adjacent area in a cross entropy mode, and calculates weights of left and right integer coordinates closest to the target y position adjacent area in a linear interpolation mode, so that the network can focus on the distribution of the target y position adjacent area more quickly, and learning capacity of the network on the distribution of the target y position adjacent area is enhanced. Finally, the 3 loss functions are weighted by a certain weight proportion to be used as the total loss function of the network.
According to the invention, by introducing the novel spatial pyramid pooling layer SPPFC2FC, gradient flow information in the model is enhanced, and spatial characteristic information among different sizes is better extracted. The invention combines the maximum pooling and convolution methods to perform shallow downsampling on the feature map, which can capture the features such as edges, textures and the like in the image and also retain the most obvious features in the image. According to the invention, depthwise Separable Convolution (DWConv) is used for increasing the depth of the model detection head, so that multi-scale and multi-level feature representation can be learned more; a Pyramid Split Attention (PSA) attention mechanism is introduced on the regression branch, so that the model adaptively adjusts the characteristic weights of different positions, focuses more on a target area, reduces background interference and improves the accurate positioning capability of the target. According to the invention, an improved gating convolution structure is introduced after the novel spatial pyramid pooling layer, so that the network is more concerned with the characteristics with important information, and simultaneously the irrelevant or redundant characteristics are restrained, thereby optimizing the flow of the gradient in the back propagation process, reducing the gradient disappearance phenomenon, accelerating the convergence rate, more fully utilizing the context information and improving the detection and recognition capability of the model on the target.
The invention has the beneficial effects that:
(1) The novel space pyramid pooling layer is used for enhancing gradient flow information of the model, extracting space characteristic information among different sizes better, and solving the problem that false detection and missing detection are easy to occur when small-size targets and dense targets.
(2) Features are extracted by combining the maximum pooling and convolution, so that features such as edges, textures and the like in the image can be captured, and the most obvious features in the image are reserved.
(3) The DWConv is used for increasing the depth of the detection head, and the PSA attention is introduced on the regression branch, so that the model adaptively adjusts the feature weights of different positions, focuses on the target area more, reduces the background interference, and improves the target accurate positioning capability.
(4) The improved gating convolution structure is introduced after the novel space pyramid pooling layer, so that the network is more focused on the characteristics with important information, irrelevant or redundant characteristics are restrained, gradient back propagation is optimized, gradient disappearance phenomenon is reduced, convergence speed is accelerated, the detection and recognition capability of a model to a target is improved, the problem of insufficient contextual information is solved, and a learnable dynamic characteristic selection mechanism is provided.
Drawings
FIG. 1 is a schematic diagram of a modified YOLOv8s structure.
Fig. 2 is a schematic diagram of the C2f structure.
Fig. 3 is a schematic structural diagram of SPPFC2 FC.
Fig. 4 is a diagram of NewConv structure.
FIG. 5 is a schematic diagram of a comparison of a generic gated convolution structure and a modified gated convolution structure.
Fig. 6 is a schematic diagram of the ECA attention mechanism.
Fig. 7 is a schematic diagram of a modified novel decoupling head psa_coupled head structure.
Fig. 8 is a schematic diagram of the structure of the PSA attention module.
Fig. 9 is a schematic of the effect of the modified YOLOv8s on the VOC2012 public dataset.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
As shown in fig. 1, a target detection algorithm based on improved YOLOv8s, the main part of the network comprises input, data preprocessing, backbone, neck and output, and specifically comprises the following steps:
1) The input to the network is an image with a resolution of 640 x 3.
2) A data preprocessing section: using the VOC2012 dataset, four enhancements of Mosaic enhancement (Mosaic), mix enhancement (mix up), spatial perturbation (Random Perspective), and color perturbation (HSVAugment) are mainly employed in training.
3) The preprocessed image is input to a backhaul part of the network, and the image is subjected to downsampling processing by using a NewConv module (as shown in fig. 4) in a shallow information extraction part of the network.
(1) The NewConv module comprises two branches, and after dimension transformation is carried out on the left branch through 1X 1 convolution, downsampling of the picture is realized by using MaxPool2 d; the right branch realizes the downsampling function through 3×3 convolution with a stride of 2; and performing add splicing after left and right branches, and performing further feature extraction and conversion by 3×3 convolution with a stride of 1, thereby increasing the representation capability and nonlinear transformation capability of the network.
(2) The downsampled image data is input to a C2f module (as shown in fig. 2), feature extraction is performed, and the ratio of the C2f module of the back bone part to the C2f module is 3:6:6:3. Compared with a C3 structure of YOLOv5, the C2f module carries out convolution operation on input through 1X 1 convolution to obtain a feature map, and then the feature map is divided into two branches according to channel dimension through split functions respectively to construct a list; the last element in the copy list is taken as input, the feature extraction operation is carried out through 3 serially connected bottlenecks, the last element is added into the list, wherein the first Bottleneck takes one branch as input, and the output of each Bottleneck can be taken as the next input. And finally, splicing all branches through a Concat function, and adjusting the number of channels by using 1X 1 convolution to promote the interaction and flow of information among the channels of the feature map.
(3) The extracted image features are input into the improved spatial pyramid pooling layer SPPFC2FC structure (as in fig. 3). SPPFC2FC architecture embeds SPPF into C2f module: after the above C2f operation, 3 MaxPool2d operations with k=5 are connected in series, and the output of each MaxPool2d operation is taken as the next input. The 3 MaxPool2d branches and the same level of shortcut branches are then spliced using a Concat function and the spatial features are integrated using a 1 x 1 convolution, adjusting the number of channels. The model can better acquire multi-scale target information, aggregate the characteristics on different receptive fields, improve the robustness of the model, and effectively avoid the problem of fixed size of the input picture of the model.
(4) The multi-scale feature information output by the spatial pyramid pooling layer is conveyed to the improved gating convolution structure (eca_gatedconv) provided by the invention to selectively weight the input features so that the network focuses more on important features and suppresses irrelevant or redundant features. First, feature extraction is performed on input data using 3×3 convolution, and the channel dimension is extended to 2 times of the input data. Dividing the obtained characteristic data into a left branch and a right branch according to the channel dimension by using split operation, limiting the input characteristic within a range of 0-1 by using a sigmoid function on the left branch, and obtaining a characteristic diagram for controlling the weight; the right branch further uses a 3 x 3 convolution for feature extraction and then uses a Exponential Linear Unit (ELU) activation function to activate the feature. The feature map obtained by the right branch is multiplied by the feature map of the control weight obtained by the left branch pixel by pixel, and a soft mask (mask) is learned from the data. Finally, a Efficient Channel Attention (ECA) attention mechanism (as in fig. 6) is introduced, and the ECA generates channel weights by performing one-dimensional convolution with a kernel size of k on the aggregated features obtained by global averaging pooling, where k is adaptively determined by mapping of the channel dimension C,
|t| odd the odd number nearest t is shown, and γ and b are set to 2 and 1 in the experiment.
4) The image feature information extracted by the Backbone part is input to the neg part of the network, and image downsampling and feature extraction are also performed by using NewConv and C2f modules to extract higher-level semantic information. On the basis of Feature Pyramid Network (FPN) path from top to bottom, a bottom-up path structure Path Aggregation Feature Pyramid Network (PAFPN) is introduced for fusing features of different levels, so that the multi-scale expression capability of the features is improved.
5) The image is predicted in a modified new type of decoupling head psa_coupled head (fig. 7) and Anchor-Free mode, including the position regression parameters and class prediction probabilities of the target detection box. Firstly, adding a 3X 3 DWConv at the beginning of a classification branch and a regression branch of a decoupling head, independently operating in the dimension of an input channel, learning stronger characteristic expression capacity through fewer parameters, better capturing the spatial correlation of input data, and generating more accurate prediction results on classification and regression tasks. Then, PSA attention is added to the regression branch (fig. 8), so that the model learns the importance weights of different feature regions, and the output of the regression branch is adjusted according to the weights, so that the model focuses more on the feature information which is more important for target positioning.
PSA attention first groups the input data into S groups per channel. Each group performs a group convolution operation using convolution kernels of different sizes (e.g., k=3, 5,7, 9), the number of groupsThe operation obtains receptive fields with different scales in a light-weight mode, and extracts information with different scales from the image. And extracting the weighted value of the channel in each group through an SE Weight module, and finally carrying out softmax normalization and weighting on the weighted value of the S group to adjust the information weights of different scales on each feature map. Through weighting processing, the model can pay more attention to the characteristic area which is more critical to target positioning, and the performance and accuracy of target detection are improved.
6) Associating each predicted frame with a corresponding real frame using a matching policy (Task Aligned Assigner) for task alignment learning, calculating IoU for the predicted frame and the real frame, and assigning a target class to each predicted frame, ensuring that the assignment of the allocator between the class and the target frame is consistent, i.e., maintaining task alignment;
selecting a group of prediction frames with the maximum t value as positive samples according to the classification and regression scores, and taking the rest of prediction frames as negative samples, wherein the formula is as follows:
t=s α ×u β (3)
wherein alpha and beta are weight super parameters, s is a prediction value corresponding to the labeling category, u is a combination ratio IoU of the prediction frame and the real frame, the alignment degree can be measured by multiplying the two, and t can simultaneously control the classification score and the optimization of IoU to realize task alignment.
Classification branching uses VFL Loss (varical Loss) as a Loss function, which is formulated as:
where p is IoU-aware classification score (IACS) of the predicted image and q is the target score. When the predicted frame is a positive sample, q is IoU of the target predicted frame and the real frame, the algorithm uses a common Loss function BCE Loss (Binary Cross Entropy Loss), and adds an adaptive IOU weight for highlighting the positive sample; when the prediction block is a negative sample, q=0, and the algorithm uses the Loss function Focal Loss to solve the problem of imbalance of the positive and negative samples.
Regression branching uses CIOU Loss and DFL Loss (Distribution Focal Loss) together as a Loss function, where CIOU is formulated as:
wherein ρ is 2 (b,b gt ) Representing the Euclidean distance between the model predictive frame and the real frame, c represents the length of the minimum circumscribed rectangular diagonal line of the model predictive frame and the real frame, alpha is a parameter for balancing the proportion, and v is used for measuring the proportion consistency between the anchor frame and the target frame. And the formula of DFL Loss is as follows:
DFL(S i ,S i+1 )=-((y i+1 -y)log(S i )-(y-y i )log(S i +1)) (8)
y is the target label value, y i ,y i+1 Is two integers y closest to y i ≤y≤y i+1 The DFL optimizes probability distribution of the target y position adjacent area in a cross entropy mode, and calculates weights of left and right integer coordinates closest to the target y position adjacent area in a linear interpolation mode, so that the network can focus on the distribution of the target y position adjacent area more quickly, and learning capacity of the network on the distribution of the target y position adjacent area is enhanced. Finally, the 3 loss functions are weighted by a certain weight proportion to be used as the total loss function of the network.
And finally, using non-maximum suppression (NMS) to the predicted target detection frame, and removing the detection frame with the confidence interval not meeting the requirement to obtain the optimal target detection frame.
The invention provides an improved YOLOv8s target detection method, which is inspired by an SPPF structure in YOLOv5, an SPPCSPC structure in YOLOv7 and a C2f structure in YOLOv8, combines the advantages of small SPPF calculation amount and high speed and the advantage of rich C2f gradient flow information, and provides a novel space pyramid pooling structure SPPFC2FC (as shown in figure 3); aiming at the problems of information loss and position accuracy reduction of a downsampling layer in the YOLO series, the invention provides a NewConv module, and features are extracted together by adopting a maximum pooling and convolution method. Thus, the method can capture the characteristics of edges, textures and the like in the image, retain the most remarkable characteristics in the image, and is applied to shallow downsampling of a network and characteristic extraction (as shown in fig. 4). To improve the performance of the target detection decoupling head, DWConv is used to increase the depth of the detection head. And (3) introducing self-adaptive strong, lightweight and efficient PSA attention (as shown in figure 8) on the regression branch, so that the model self-adaptively adjusts the characteristic weights of different positions, focuses more on a target area, reduces background interference and improves the target accurate positioning capability. Referring to fig. 5, the present invention proposes an improved gating convolution structure according to the general gating convolution. After the improved gating convolution is added into the SPPFC2FC module, the network is enabled to pay more attention to the characteristics with important information, the irrelevant or redundant characteristics are restrained, the back propagation of the gradient is optimized, the gradient disappearance phenomenon is reduced, the convergence speed is accelerated, and the detection and recognition capability of the model to the target is improved.
The invention (1) uses a novel space pyramid pooling layer to enhance the gradient flow information of the model and better extract the space characteristic information among different sizes. (2) Features are extracted by adopting maximum pooling and convolution, so that features such as edges, textures and the like in the image can be captured, and the most obvious features in the image are reserved. (3) The DWConv is used for increasing the depth of the detection head, and the PSA attention is introduced on the regression branch, so that the model adaptively adjusts the feature weights of different positions, focuses on the target area more, reduces the background interference, and improves the target accurate positioning capability. (4) The improved gating convolution structure is introduced after the novel space pyramid pooling layer, so that the network is more focused on the characteristics with important information, the irrelevant or redundant characteristics are restrained, the counter propagation of the gradient is optimized, the gradient vanishing phenomenon is reduced, the convergence speed is accelerated, and the detection and recognition capability of the model on the target is improved.
Compared to the original YOLOv8s target detector, the improved YOLOv8s of the present invention increased 3.0% for map0.5 and 3.6% for map0.95 on the VOC2012 dataset (see fig. 9).

Claims (4)

1. An improved YOLOv8 s-based target detection algorithm is characterized in that: the method comprises the following steps:
1) The input to the network is an image with a resolution of 640 x 3;
2) A data preprocessing section: using the VOC2012 dataset, and adopting four enhancement means of mosaic enhancement, mixed enhancement, spatial disturbance and color disturbance during training;
3) Inputting the preprocessed image into a backstbone part of a network, and performing downsampling processing on the image by using a NewConv module in a shallow network information extraction part;
4) Inputting image feature information extracted by a back box part into a Neck part of a network, and also using NewConv and C2f modules to perform image downsampling and feature extraction so as to extract higher-level semantic information, and introducing a bottom-up path to form a PAFPN structure on the basis of a top-down path of the FPN for fusing features of different levels, so that the multi-scale expression capability of the features is improved;
5) Predicting the image by using an improved novel decoupling head PSA_decoupledhead mode and an Anchor-Free mode, wherein the image comprises position regression parameters and category prediction probability of a target detection frame; firstly, respectively adding a 3 multiplied by 3 DWConv at the beginning of a classification branch and a regression branch of a decoupling head, independently operating in the dimension of an input channel, learning stronger characteristic expression capacity through fewer parameters, better capturing the spatial correlation of input data, and generating more accurate prediction results on classification and regression tasks; then, adding PSA attention to the regression branch to enable the model to learn importance weights of different feature areas, and adjusting output of the regression branch according to the weights so that the model pays more attention to feature information which is more important for target positioning;
6) Using a matching strategy of task alignment learning to associate each prediction frame with a corresponding real frame, calculating IoU of the prediction frames and the real frames, and distributing a target category for each prediction frame, so as to ensure that the distribution of the distributor between the category and the target frame is consistent, namely keeping the task alignment; and finally, non-maximum suppression is used for the predicted target detection frame, and the detection frame with the confidence interval not meeting the requirements is removed, so that the optimal target detection frame is obtained.
2. The improved YOLOv8 s-based object detection algorithm of claim 1, wherein: the step 3) specifically comprises the following steps:
(1) The NewConv module comprises two branches, and after dimension transformation is carried out on the left branch through 1X 1 convolution, downsampling of the picture is realized by using MaxPool2 d; the right branch realizes the downsampling function through 3×3 convolution with a stride of 2; performing add splicing after left and right branches, and performing further feature extraction and conversion by 3×3 convolution with a stride of 1, thereby increasing the representation capability and nonlinear transformation capability of the network;
(2) Inputting the downsampled image data to a C2f module for feature extraction, wherein the use ratio of the C2f module of the back bone part is 3:6:6:3; compared with a C3 structure of YOLOv5, the C2f module carries out convolution operation on input through 1X 1 convolution to obtain a feature map, and then the feature map is divided into two branches according to channel dimension through split functions respectively to construct a list; the last element in the copy list is used as input, the feature extraction operation is carried out through 3 serially connected Bottleneck, then the last element is added into the list, wherein the first Bottleneck takes one branch as input, and the output of each Bottleneck can be used as the next input; finally, all branches are spliced through a Concat function, and the number of channels is adjusted by using 1 multiplied by 1 convolution, so that information among the channels of the feature map is interacted and flowed;
(3) Inputting the extracted image features into an improved spatial pyramid pooling layer SPPFC2FC structure; SPPFC2FC architecture embeds SPPF into C2f module: after the operation of C2f, 3 MaxPool2d operations with k=5 are connected in series, and the output of each MaxPool2d operation is used as the next input; then using a Concat function to splice the 3 MaxPool2d branches and the shortcut branches of the same level, and using 1X 1 convolution to integrate space characteristics and adjust the channel number; the model can better acquire multi-scale target information, the characteristics on different receptive fields are aggregated, the robustness of the model is improved, and the problem that the size of an input picture of the model is fixed is effectively avoided;
(4) The multi-scale feature information output by the spatial pyramid pooling layer is conveyed to an improved gating convolution structure to selectively weight input features, so that the network focuses on important features more and suppresses irrelevant or redundant features; firstly, carrying out feature extraction on input data by using 3×3 convolution, and expanding the channel dimension to 2 times of the input data; dividing the obtained characteristic data into a left branch and a right branch according to the channel dimension by using split operation, limiting the input characteristic within a range of 0-1 by using a sigmoid function on the left branch, and obtaining a characteristic diagram for controlling the weight; the right branch further uses 3 x 3 convolution for feature extraction and then uses the ELU activation function to activate the feature; multiplying the characteristic diagram obtained by the right branch with the characteristic diagram of the control weight obtained by the left branch pixel by pixel, and learning from the data to generate a soft mask; finally introducing an ECA attention mechanism, generating channel weights by ECA through carrying out one-dimensional convolution with a kernel size of k on the aggregated features obtained by global average pooling, wherein k is adaptively determined through mapping of channel dimension C,
|t| odd the odd number nearest t is shown, and γ and b are set to 2 and 1 in the experiment.
3. An improved YOLOv8s based object detection algorithm according to claim 2, wherein: the step 5) specifically comprises the following steps:
PSA attention first groups input data into S groups by lanes; group convolution operations with different sizes of convolution kernels per group, number of groupsThe operation obtains receptive fields with different scales in a light-weight mode, and extracts information with different scales from the image; extracting the weighted value of the channel in each group through an SE Weight module, and finally carrying out softmax normalization and weighting on the weighted value of the S group to adjust the information weights of different scales on each feature map; through weighting processing, the model can pay more attention to the characteristic area which is more critical to target positioning, and the performance and accuracy of target detection are improved.
4. A modified YOLOv8 s-based object detection algorithm according to claim 3, wherein: the step 6) specifically includes:
selecting a group of prediction frames with the maximum t value as positive samples according to the classification and regression scores, and taking the rest of prediction frames as negative samples, wherein the formula is as follows:
t=s α ×u β (3)
wherein alpha and beta are weight super parameters, s is a prediction value corresponding to the labeling category, u is a combination ratio IoU of the prediction frame and the real frame, the alignment degree can be measured by multiplying the two, and t can simultaneously control the classification score and the optimization of IoU to realize task alignment;
classification branching uses VFL Loss as a Loss function, and the formula for VFL Loss is expressed as:
where p is IACS of the predicted image and q is the target score; when the prediction frame is a positive sample, q is IoU of the target prediction frame and the real frame, the algorithm uses a common Loss function BCE Loss, and adds an adaptive IOU weight for highlighting the positive sample; when the prediction frame is a negative sample, q=0, and the algorithm uses the Loss function Focal Loss to solve the problem of unbalance of the positive and negative samples;
regression branching uses CIOU Loss and DFL Loss together as a Loss function, where CIOU is formulated as:
wherein ρ is 2 (b,b gt ) Representing the Euclidean distance between the model prediction frame and the real frame, c represents the length of the minimum circumscribed rectangular diagonal line of the model prediction frame and the real frame, alpha is a parameter for balancing the proportion, and v is used for measuring the proportion consistency between the anchor frame and the target frame; and the formula of DFL Loss is as follows:
DFL(S i ,S i+1 )=-((y i+1 -y)log(S i )+(y-y i )log(S i+1 )) (8)
y is the target label value, y i ,y i+1 Is two integers y closest to y i ≤y≤y i+1 The DFL optimizes probability distribution of a target y position adjacent area in a cross entropy mode, and calculates weights of left and right integer coordinates closest to the target y position adjacent area in a linear interpolation mode, so that the network is focused to the distribution of the target y position adjacent area more quickly, and learning capacity of the network on the distribution of the target y position adjacent area is enhanced; finally, the 3 loss functions are weighted by a certain weight proportion to be used as the total loss function of the network.
CN202311436139.6A 2023-11-01 2023-11-01 Target detection algorithm based on improved YOLOv8s Pending CN117456167A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311436139.6A CN117456167A (en) 2023-11-01 2023-11-01 Target detection algorithm based on improved YOLOv8s

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311436139.6A CN117456167A (en) 2023-11-01 2023-11-01 Target detection algorithm based on improved YOLOv8s

Publications (1)

Publication Number Publication Date
CN117456167A true CN117456167A (en) 2024-01-26

Family

ID=89585029

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311436139.6A Pending CN117456167A (en) 2023-11-01 2023-11-01 Target detection algorithm based on improved YOLOv8s

Country Status (1)

Country Link
CN (1) CN117456167A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118015555A (en) * 2024-04-10 2024-05-10 南京国电南自轨道交通工程有限公司 Knife switch state identification method based on visual detection and mask pattern direction vector
CN118262331A (en) * 2024-03-04 2024-06-28 浙江浙蕨科技有限公司 Discontinuous frame-based traffic sign board multi-target tracking deep learning method

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118262331A (en) * 2024-03-04 2024-06-28 浙江浙蕨科技有限公司 Discontinuous frame-based traffic sign board multi-target tracking deep learning method
CN118015555A (en) * 2024-04-10 2024-05-10 南京国电南自轨道交通工程有限公司 Knife switch state identification method based on visual detection and mask pattern direction vector

Similar Documents

Publication Publication Date Title
CN112884064B (en) Target detection and identification method based on neural network
CN112507777A (en) Optical remote sensing image ship detection and segmentation method based on deep learning
CN111950453B (en) Random shape text recognition method based on selective attention mechanism
CN112818862B (en) Face tampering detection method and system based on multi-source clues and mixed attention
CN109741318B (en) Real-time detection method of single-stage multi-scale specific target based on effective receptive field
CN112150821B (en) Lightweight vehicle detection model construction method, system and device
CN117456167A (en) Target detection algorithm based on improved YOLOv8s
CN112150493A (en) Semantic guidance-based screen area detection method in natural scene
CN111008639B (en) License plate character recognition method based on attention mechanism
CN113361645B (en) Target detection model construction method and system based on meta learning and knowledge memory
CN114419413A (en) Method for constructing sensing field self-adaptive transformer substation insulator defect detection neural network
WO2024032010A1 (en) Transfer learning strategy-based real-time few-shot object detection method
CN113762327B (en) Machine learning method, machine learning system and non-transitory computer readable medium
CN110827312A (en) Learning method based on cooperative visual attention neural network
CN111860587A (en) Method for detecting small target of picture
CN113743505A (en) Improved SSD target detection method based on self-attention and feature fusion
CN116342894A (en) GIS infrared feature recognition system and method based on improved YOLOv5
CN111126155B (en) Pedestrian re-identification method for generating countermeasure network based on semantic constraint
CN114333062B (en) Pedestrian re-recognition model training method based on heterogeneous dual networks and feature consistency
CN117079095A (en) Deep learning-based high-altitude parabolic detection method, system, medium and equipment
CN115171074A (en) Vehicle target identification method based on multi-scale yolo algorithm
CN113609904B (en) Single-target tracking algorithm based on dynamic global information modeling and twin network
CN111222534A (en) Single-shot multi-frame detector optimization method based on bidirectional feature fusion and more balanced L1 loss
CN117710965A (en) Small target detection method based on improved YOLOv5
CN116824333A (en) Nasopharyngeal carcinoma detecting system based on deep learning model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination