CN117456167A

CN117456167A - Target detection algorithm based on improved YOLOv8s

Info

Publication number: CN117456167A
Application number: CN202311436139.6A
Authority: CN
Inventors: 邵叶秦; 王梓腾; 吕昌; 张若为; 杨国青; 许长勇; 冯林威
Original assignee: Nantong University
Current assignee: Nantong University
Priority date: 2023-11-01
Filing date: 2023-11-01
Publication date: 2024-01-26

Abstract

The invention discloses a target detection algorithm based on improved YOLOv8s, a model is retrained on a VOC2012 data set, and a main body part of a network comprises input, data preprocessing, backbone, neck and output. The method enhances the gradient flow information in the model, better extracts the spatial characteristic information among different sizes, can capture the edge, texture and other characteristics in the image, also retains the most obvious characteristics in the image, learns more to multi-scale and multi-layer characteristic representation, enables the model to adaptively adjust the characteristic weights of different positions, focuses more on the target area, reduces the background interference, improves the accurate positioning capability of the target, optimizes the flow of the gradient in the counter propagation process, reduces the gradient disappearance phenomenon, accelerates the convergence speed, fully utilizes the context information, and improves the detection and recognition capability of the model on the target.

Description

Target detection algorithm based on improved YOLOv8s

Technical Field

The invention relates to the field of computer vision and target detection, in particular to a target detection algorithm based on improved YOLOv8 s.

Background

The object detection algorithm based on deep learning has made remarkable progress in the field of computer vision. These models exhibit more powerful performance and potential in target detection tasks than conventional approaches.

The core of the object detection task is to locate and classify objects in the image. The target detection algorithm based on deep learning is mainly divided into Two types, namely Two-stage and One-stage. The Two-stage target detection algorithm firstly generates a region, then carries out target classification through a convolutional neural network, and comprises the following steps: R-CNN, mask R-CNN, SPPNet, fast R-CNN, etc. The One-stage target detection algorithm is directly positioned and classified through a convolutional neural network, a candidate region generation process is not needed, and the common One-stage target detection algorithm comprises a YOLO series, an SSD, a DSSD, an FSSD and the like.

YOLOv8 is the latest version of the YOLOv series algorithm, and the YOLOv8 is remarkably improved in real-time performance and prediction accuracy after multiple iterations and optimization. Five different network models, YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l and YOLOv8x, are included in YOLOv8, and balance between speed and accuracy according to the difference of scaling factors.

Object detection algorithms are prone to false and missed detection when dealing with small-sized objects and dense objects, meaning that the algorithm may in these cases falsely mark non-existing objects or fail to detect existing objects correctly. Thus, the accuracy of the target detection algorithm still leaves room for improvement.

The VOC2012 dataset encompasses multiple target categories, but the number of samples between categories is not uniform, resulting in a category imbalance problem that may make the algorithm overly dependent on certain categories while ignoring others.

In complex scenarios, objects often experience occlusion or overlap, which presents a significant challenge to the object detection algorithm. This situation may cause errors in the algorithm detection of the target, thereby affecting the accuracy of the algorithm.

Disclosure of Invention

The invention aims to: the invention overcomes the defects of the prior art, and provides the target detection algorithm based on the improved YOLOv8s, compared with the original YOLOv8s target detection algorithm, the improved algorithm provided by the invention has better performance on the common data set VOC 2012.

The invention adopts the technical scheme that: an improved YOLOv8 s-based target detection algorithm, comprising the steps of:

1) The input to the network is an image with a resolution of 640 x 3.

2) A data preprocessing section: using the VOC2012 dataset, four enhancements of Mosaic enhancement (Mosaic), mix enhancement (mix up), spatial perturbation (Random Perspective), and color perturbation (HSV Augment) are mainly employed in training.

3) The preprocessed image is input to a backstbone part of a network, and the image is subjected to downsampling processing by using a NewConv module in a shallow network information extraction part.

4) The image feature information extracted by the Backbone part is input to the neg part of the network, and image downsampling and feature extraction are also performed by using NewConv and C2f modules to extract higher-level semantic information. On the basis of Feature Pyramid Network (FPN) path from top to bottom, a bottom-up path structure Path Aggregation Feature PyramidNetwork (PAFPN) is introduced for fusing features of different levels, so that the multi-scale expression capability of the features is improved.

5) The image is predicted by using the improved novel decoupling heads PSA_coupled head and Anchor-Free mode, including the position regression parameters and the category prediction probability of the target detection frame. Firstly, adding a 3X 3 DWConv at the beginning of a classification branch and a regression branch of a decoupling head, independently operating in the dimension of an input channel, learning stronger characteristic expression capacity through fewer parameters, better capturing the spatial correlation of input data, and generating more accurate prediction results on classification and regression tasks. And then, adding the PSA attention to the regression branch, so that the model learns importance weights of different feature areas, and adjusting the output of the regression branch according to the weights, so that the model pays more attention to feature information which is more important for target positioning.

6) Associating each predicted frame with a corresponding real frame using a matching policy (Task Aligned Assigner) for task alignment learning, calculating IoU for the predicted frame and the real frame, and assigning a target class to each predicted frame, ensuring that the assignment of the allocator between the class and the target frame is consistent, i.e., maintaining task alignment; and finally, using non-maximum suppression (NMS) to the predicted target detection frame, and removing the detection frame with the confidence interval not meeting the requirement to obtain the optimal target detection frame.

Further, the step 3) specifically includes:

(1) The NewConv module comprises two branches, and after dimension transformation is carried out on the left branch through 1X 1 convolution, downsampling of the picture is realized by using MaxPool2 d; the right branch realizes the downsampling function through 3×3 convolution with a stride of 2; and performing add splicing after left and right branches, and performing further feature extraction and conversion by 3×3 convolution with a stride of 1, thereby increasing the representation capability and nonlinear transformation capability of the network.

(2) And inputting the downsampled image data into a C2f module for feature extraction, wherein the use ratio of the C2f module of the back bone part is 3:6:6:3. Compared with a C3 structure of YOLOv5, the C2f module carries out convolution operation on input through 1X 1 convolution to obtain a feature map, and then the feature map is divided into two branches according to channel dimension through split functions respectively to construct a list; the last element in the copy list is taken as input, the feature extraction operation is carried out through 3 serially connected bottlenecks, the last element is added into the list, wherein the first Bottleneck takes one branch as input, and the output of each Bottleneck can be taken as the next input. And finally, splicing all branches through a Concat function, and adjusting the number of channels by using 1X 1 convolution to promote the interaction and flow of information among the channels of the feature map.

(3) The extracted image features are input into the improved spatial pyramid pooling layer SPPFC2FC structure. SPPFC2FC architecture embeds SPPF into C2f module: after the above C2f operation, 3 MaxPool2d operations with k=5 are connected in series, and the output of each MaxPool2d operation is taken as the next input. The 3 MaxPool2d branches and the same level of shortcut branches are then spliced using a Concat function and the spatial features are integrated using a 1 x 1 convolution, adjusting the number of channels. The model can better acquire multi-scale target information, aggregate the characteristics on different receptive fields, improve the robustness of the model, and effectively avoid the problem of fixed size of the input picture of the model.

(4) The multi-scale feature information output by the spatial pyramid pooling layer is conveyed to the improved gating convolution structure (eca_gatedconv) provided by the invention to selectively weight the input features so that the network focuses more on important features and suppresses irrelevant or redundant features. First, feature extraction is performed on input data using 3×3 convolution, and the channel dimension is extended to 2 times of the input data. Dividing the obtained characteristic data into a left branch and a right branch according to the channel dimension by using split operation, limiting the input characteristic within a range of 0-1 by using a sigmoid function on the left branch, and obtaining a characteristic diagram for controlling the weight; the right branch further uses a 3 x 3 convolution for feature extraction and then uses a Exponential Linear Unit (ELU) activation function to activate the feature. The feature map obtained by the right branch is multiplied by the feature map of the control weight obtained by the left branch pixel by pixel, and a soft mask (mask) is learned from the data. Finally, introducing a Efficient Channel Attention (ECA) attention mechanism, wherein ECA generates channel weights by carrying out one-dimensional convolution with a kernel size of k on the aggregated features obtained by global average pooling, wherein k is adaptively determined through mapping of channel dimension C,

|t| _odd the odd number nearest t is shown, and γ and b are set to 2 and 1 in the experiment.

Further, the step 5) specifically includes:

PSA attention first groups the input data into S groups per channel. Each group performs a group convolution operation using convolution kernels of different sizes (e.g., k=3, 5,7, 9), the number of groupsThe operation obtains receptive fields with different scales in a light-weight mode, and extracts information with different scales from the image. And extracting the weighted value of the channel in each group through an SE Weight module, and finally carrying out softmax normalization and weighting on the weighted value of the S group to adjust the information weights of different scales on each feature map. Through weighting processing, the model can pay more attention to the characteristic area which is more critical to target positioning, and the performance and accuracy of target detection are improved.

Further, the step 6) specifically includes:

selecting a group of prediction frames with the maximum t value as positive samples according to the classification and regression scores, and taking the rest of prediction frames as negative samples, wherein the formula is as follows:

t＝s ^α ×u ^β (3)

wherein alpha and beta are weight super parameters, s is a prediction value corresponding to the labeling category, u is a combination ratio IoU of the prediction frame and the real frame, the alignment degree can be measured by multiplying the two, and t can simultaneously control the classification score and the optimization of IoU to realize task alignment.

Classification branching uses VFL Loss (varical Loss) as a Loss function, which is formulated as:

where p is IoU-aware classification score (IACS) of the predicted image and q is the target score. When the predicted frame is a positive sample, q is IoU of the target predicted frame and the real frame, the algorithm uses a common Loss function BCE Loss (Binary Cross Entropy Loss), and adds an adaptive IOU weight for highlighting the positive sample; when the prediction block is a negative sample, q=0, and the algorithm uses the Loss function Focal Loss to solve the problem of imbalance of the positive and negative samples.

Regression branching uses CIOU Loss and DFL Loss (Distribution Focal Loss) together as a Loss function, where CIOU is formulated as:

wherein ρ is ² (b,b ^gt ) Representing the Euclidean distance between the model predictive frame and the real frame, c represents the length of the minimum circumscribed rectangular diagonal line of the model predictive frame and the real frame, alpha is a parameter for balancing the proportion, and v is used for measuring the proportion consistency between the anchor frame and the target frame. And the formula of DFL Loss is as follows:

DFL(S _i ,S _i+1 )=-((y _i+1 -y)log(S _i )-(y-y _i )log(S _i +1)) (8)

y is the target label value, y _i ，y _i+1 Is two integers y closest to y _i ≤y≤y _i+1 The DFL optimizes probability distribution of the target y position adjacent area in a cross entropy mode, and calculates weights of left and right integer coordinates closest to the target y position adjacent area in a linear interpolation mode, so that the network can focus on the distribution of the target y position adjacent area more quickly, and learning capacity of the network on the distribution of the target y position adjacent area is enhanced. Finally, the 3 loss functions are weighted by a certain weight proportion to be used as the total loss function of the network.

According to the invention, by introducing the novel spatial pyramid pooling layer SPPFC2FC, gradient flow information in the model is enhanced, and spatial characteristic information among different sizes is better extracted. The invention combines the maximum pooling and convolution methods to perform shallow downsampling on the feature map, which can capture the features such as edges, textures and the like in the image and also retain the most obvious features in the image. According to the invention, depthwise Separable Convolution (DWConv) is used for increasing the depth of the model detection head, so that multi-scale and multi-level feature representation can be learned more; a Pyramid Split Attention (PSA) attention mechanism is introduced on the regression branch, so that the model adaptively adjusts the characteristic weights of different positions, focuses more on a target area, reduces background interference and improves the accurate positioning capability of the target. According to the invention, an improved gating convolution structure is introduced after the novel spatial pyramid pooling layer, so that the network is more concerned with the characteristics with important information, and simultaneously the irrelevant or redundant characteristics are restrained, thereby optimizing the flow of the gradient in the back propagation process, reducing the gradient disappearance phenomenon, accelerating the convergence rate, more fully utilizing the context information and improving the detection and recognition capability of the model on the target.

The invention has the beneficial effects that:

(1) The novel space pyramid pooling layer is used for enhancing gradient flow information of the model, extracting space characteristic information among different sizes better, and solving the problem that false detection and missing detection are easy to occur when small-size targets and dense targets.

(2) Features are extracted by combining the maximum pooling and convolution, so that features such as edges, textures and the like in the image can be captured, and the most obvious features in the image are reserved.

(3) The DWConv is used for increasing the depth of the detection head, and the PSA attention is introduced on the regression branch, so that the model adaptively adjusts the feature weights of different positions, focuses on the target area more, reduces the background interference, and improves the target accurate positioning capability.

(4) The improved gating convolution structure is introduced after the novel space pyramid pooling layer, so that the network is more focused on the characteristics with important information, irrelevant or redundant characteristics are restrained, gradient back propagation is optimized, gradient disappearance phenomenon is reduced, convergence speed is accelerated, the detection and recognition capability of a model to a target is improved, the problem of insufficient contextual information is solved, and a learnable dynamic characteristic selection mechanism is provided.

Drawings

FIG. 1 is a schematic diagram of a modified YOLOv8s structure.

Fig. 2 is a schematic diagram of the C2f structure.

Fig. 3 is a schematic structural diagram of SPPFC2 FC.

Fig. 4 is a diagram of NewConv structure.

FIG. 5 is a schematic diagram of a comparison of a generic gated convolution structure and a modified gated convolution structure.

Fig. 6 is a schematic diagram of the ECA attention mechanism.

Fig. 7 is a schematic diagram of a modified novel decoupling head psa_coupled head structure.

Fig. 8 is a schematic diagram of the structure of the PSA attention module.

Fig. 9 is a schematic of the effect of the modified YOLOv8s on the VOC2012 public dataset.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

As shown in fig. 1, a target detection algorithm based on improved YOLOv8s, the main part of the network comprises input, data preprocessing, backbone, neck and output, and specifically comprises the following steps:

1) The input to the network is an image with a resolution of 640 x 3.

2) A data preprocessing section: using the VOC2012 dataset, four enhancements of Mosaic enhancement (Mosaic), mix enhancement (mix up), spatial perturbation (Random Perspective), and color perturbation (HSVAugment) are mainly employed in training.

3) The preprocessed image is input to a backhaul part of the network, and the image is subjected to downsampling processing by using a NewConv module (as shown in fig. 4) in a shallow information extraction part of the network.

(2) The downsampled image data is input to a C2f module (as shown in fig. 2), feature extraction is performed, and the ratio of the C2f module of the back bone part to the C2f module is 3:6:6:3. Compared with a C3 structure of YOLOv5, the C2f module carries out convolution operation on input through 1X 1 convolution to obtain a feature map, and then the feature map is divided into two branches according to channel dimension through split functions respectively to construct a list; the last element in the copy list is taken as input, the feature extraction operation is carried out through 3 serially connected bottlenecks, the last element is added into the list, wherein the first Bottleneck takes one branch as input, and the output of each Bottleneck can be taken as the next input. And finally, splicing all branches through a Concat function, and adjusting the number of channels by using 1X 1 convolution to promote the interaction and flow of information among the channels of the feature map.

(3) The extracted image features are input into the improved spatial pyramid pooling layer SPPFC2FC structure (as in fig. 3). SPPFC2FC architecture embeds SPPF into C2f module: after the above C2f operation, 3 MaxPool2d operations with k=5 are connected in series, and the output of each MaxPool2d operation is taken as the next input. The 3 MaxPool2d branches and the same level of shortcut branches are then spliced using a Concat function and the spatial features are integrated using a 1 x 1 convolution, adjusting the number of channels. The model can better acquire multi-scale target information, aggregate the characteristics on different receptive fields, improve the robustness of the model, and effectively avoid the problem of fixed size of the input picture of the model.

(4) The multi-scale feature information output by the spatial pyramid pooling layer is conveyed to the improved gating convolution structure (eca_gatedconv) provided by the invention to selectively weight the input features so that the network focuses more on important features and suppresses irrelevant or redundant features. First, feature extraction is performed on input data using 3×3 convolution, and the channel dimension is extended to 2 times of the input data. Dividing the obtained characteristic data into a left branch and a right branch according to the channel dimension by using split operation, limiting the input characteristic within a range of 0-1 by using a sigmoid function on the left branch, and obtaining a characteristic diagram for controlling the weight; the right branch further uses a 3 x 3 convolution for feature extraction and then uses a Exponential Linear Unit (ELU) activation function to activate the feature. The feature map obtained by the right branch is multiplied by the feature map of the control weight obtained by the left branch pixel by pixel, and a soft mask (mask) is learned from the data. Finally, a Efficient Channel Attention (ECA) attention mechanism (as in fig. 6) is introduced, and the ECA generates channel weights by performing one-dimensional convolution with a kernel size of k on the aggregated features obtained by global averaging pooling, where k is adaptively determined by mapping of the channel dimension C,

4) The image feature information extracted by the Backbone part is input to the neg part of the network, and image downsampling and feature extraction are also performed by using NewConv and C2f modules to extract higher-level semantic information. On the basis of Feature Pyramid Network (FPN) path from top to bottom, a bottom-up path structure Path Aggregation Feature Pyramid Network (PAFPN) is introduced for fusing features of different levels, so that the multi-scale expression capability of the features is improved.

5) The image is predicted in a modified new type of decoupling head psa_coupled head (fig. 7) and Anchor-Free mode, including the position regression parameters and class prediction probabilities of the target detection box. Firstly, adding a 3X 3 DWConv at the beginning of a classification branch and a regression branch of a decoupling head, independently operating in the dimension of an input channel, learning stronger characteristic expression capacity through fewer parameters, better capturing the spatial correlation of input data, and generating more accurate prediction results on classification and regression tasks. Then, PSA attention is added to the regression branch (fig. 8), so that the model learns the importance weights of different feature regions, and the output of the regression branch is adjusted according to the weights, so that the model focuses more on the feature information which is more important for target positioning.

6) Associating each predicted frame with a corresponding real frame using a matching policy (Task Aligned Assigner) for task alignment learning, calculating IoU for the predicted frame and the real frame, and assigning a target class to each predicted frame, ensuring that the assignment of the allocator between the class and the target frame is consistent, i.e., maintaining task alignment;

t＝s ^α ×u ^β (3)

DFL(S _i ,S _i+1 )=-((y _i+1 -y)log(S _i )-(y-y _i )log(S _i +1)) (8)

And finally, using non-maximum suppression (NMS) to the predicted target detection frame, and removing the detection frame with the confidence interval not meeting the requirement to obtain the optimal target detection frame.

The invention provides an improved YOLOv8s target detection method, which is inspired by an SPPF structure in YOLOv5, an SPPCSPC structure in YOLOv7 and a C2f structure in YOLOv8, combines the advantages of small SPPF calculation amount and high speed and the advantage of rich C2f gradient flow information, and provides a novel space pyramid pooling structure SPPFC2FC (as shown in figure 3); aiming at the problems of information loss and position accuracy reduction of a downsampling layer in the YOLO series, the invention provides a NewConv module, and features are extracted together by adopting a maximum pooling and convolution method. Thus, the method can capture the characteristics of edges, textures and the like in the image, retain the most remarkable characteristics in the image, and is applied to shallow downsampling of a network and characteristic extraction (as shown in fig. 4). To improve the performance of the target detection decoupling head, DWConv is used to increase the depth of the detection head. And (3) introducing self-adaptive strong, lightweight and efficient PSA attention (as shown in figure 8) on the regression branch, so that the model self-adaptively adjusts the characteristic weights of different positions, focuses more on a target area, reduces background interference and improves the target accurate positioning capability. Referring to fig. 5, the present invention proposes an improved gating convolution structure according to the general gating convolution. After the improved gating convolution is added into the SPPFC2FC module, the network is enabled to pay more attention to the characteristics with important information, the irrelevant or redundant characteristics are restrained, the back propagation of the gradient is optimized, the gradient disappearance phenomenon is reduced, the convergence speed is accelerated, and the detection and recognition capability of the model to the target is improved.

The invention (1) uses a novel space pyramid pooling layer to enhance the gradient flow information of the model and better extract the space characteristic information among different sizes. (2) Features are extracted by adopting maximum pooling and convolution, so that features such as edges, textures and the like in the image can be captured, and the most obvious features in the image are reserved. (3) The DWConv is used for increasing the depth of the detection head, and the PSA attention is introduced on the regression branch, so that the model adaptively adjusts the feature weights of different positions, focuses on the target area more, reduces the background interference, and improves the target accurate positioning capability. (4) The improved gating convolution structure is introduced after the novel space pyramid pooling layer, so that the network is more focused on the characteristics with important information, the irrelevant or redundant characteristics are restrained, the counter propagation of the gradient is optimized, the gradient vanishing phenomenon is reduced, the convergence speed is accelerated, and the detection and recognition capability of the model on the target is improved.

Compared to the original YOLOv8s target detector, the improved YOLOv8s of the present invention increased 3.0% for map0.5 and 3.6% for map0.95 on the VOC2012 dataset (see fig. 9).

Claims

1. An improved YOLOv8 s-based target detection algorithm is characterized in that: the method comprises the following steps:

1) The input to the network is an image with a resolution of 640 x 3;

2) A data preprocessing section: using the VOC2012 dataset, and adopting four enhancement means of mosaic enhancement, mixed enhancement, spatial disturbance and color disturbance during training;

3) Inputting the preprocessed image into a backstbone part of a network, and performing downsampling processing on the image by using a NewConv module in a shallow network information extraction part;

4) Inputting image feature information extracted by a back box part into a Neck part of a network, and also using NewConv and C2f modules to perform image downsampling and feature extraction so as to extract higher-level semantic information, and introducing a bottom-up path to form a PAFPN structure on the basis of a top-down path of the FPN for fusing features of different levels, so that the multi-scale expression capability of the features is improved;

5) Predicting the image by using an improved novel decoupling head PSA_decoupledhead mode and an Anchor-Free mode, wherein the image comprises position regression parameters and category prediction probability of a target detection frame; firstly, respectively adding a 3 multiplied by 3 DWConv at the beginning of a classification branch and a regression branch of a decoupling head, independently operating in the dimension of an input channel, learning stronger characteristic expression capacity through fewer parameters, better capturing the spatial correlation of input data, and generating more accurate prediction results on classification and regression tasks; then, adding PSA attention to the regression branch to enable the model to learn importance weights of different feature areas, and adjusting output of the regression branch according to the weights so that the model pays more attention to feature information which is more important for target positioning;

6) Using a matching strategy of task alignment learning to associate each prediction frame with a corresponding real frame, calculating IoU of the prediction frames and the real frames, and distributing a target category for each prediction frame, so as to ensure that the distribution of the distributor between the category and the target frame is consistent, namely keeping the task alignment; and finally, non-maximum suppression is used for the predicted target detection frame, and the detection frame with the confidence interval not meeting the requirements is removed, so that the optimal target detection frame is obtained.

2. The improved YOLOv8 s-based object detection algorithm of claim 1, wherein: the step 3) specifically comprises the following steps:

(1) The NewConv module comprises two branches, and after dimension transformation is carried out on the left branch through 1X 1 convolution, downsampling of the picture is realized by using MaxPool2 d; the right branch realizes the downsampling function through 3×3 convolution with a stride of 2; performing add splicing after left and right branches, and performing further feature extraction and conversion by 3×3 convolution with a stride of 1, thereby increasing the representation capability and nonlinear transformation capability of the network;

(2) Inputting the downsampled image data to a C2f module for feature extraction, wherein the use ratio of the C2f module of the back bone part is 3:6:6:3; compared with a C3 structure of YOLOv5, the C2f module carries out convolution operation on input through 1X 1 convolution to obtain a feature map, and then the feature map is divided into two branches according to channel dimension through split functions respectively to construct a list; the last element in the copy list is used as input, the feature extraction operation is carried out through 3 serially connected Bottleneck, then the last element is added into the list, wherein the first Bottleneck takes one branch as input, and the output of each Bottleneck can be used as the next input; finally, all branches are spliced through a Concat function, and the number of channels is adjusted by using 1 multiplied by 1 convolution, so that information among the channels of the feature map is interacted and flowed;

(3) Inputting the extracted image features into an improved spatial pyramid pooling layer SPPFC2FC structure; SPPFC2FC architecture embeds SPPF into C2f module: after the operation of C2f, 3 MaxPool2d operations with k=5 are connected in series, and the output of each MaxPool2d operation is used as the next input; then using a Concat function to splice the 3 MaxPool2d branches and the shortcut branches of the same level, and using 1X 1 convolution to integrate space characteristics and adjust the channel number; the model can better acquire multi-scale target information, the characteristics on different receptive fields are aggregated, the robustness of the model is improved, and the problem that the size of an input picture of the model is fixed is effectively avoided;

(4) The multi-scale feature information output by the spatial pyramid pooling layer is conveyed to an improved gating convolution structure to selectively weight input features, so that the network focuses on important features more and suppresses irrelevant or redundant features; firstly, carrying out feature extraction on input data by using 3×3 convolution, and expanding the channel dimension to 2 times of the input data; dividing the obtained characteristic data into a left branch and a right branch according to the channel dimension by using split operation, limiting the input characteristic within a range of 0-1 by using a sigmoid function on the left branch, and obtaining a characteristic diagram for controlling the weight; the right branch further uses 3 x 3 convolution for feature extraction and then uses the ELU activation function to activate the feature; multiplying the characteristic diagram obtained by the right branch with the characteristic diagram of the control weight obtained by the left branch pixel by pixel, and learning from the data to generate a soft mask; finally introducing an ECA attention mechanism, generating channel weights by ECA through carrying out one-dimensional convolution with a kernel size of k on the aggregated features obtained by global average pooling, wherein k is adaptively determined through mapping of channel dimension C,

3. An improved YOLOv8s based object detection algorithm according to claim 2, wherein: the step 5) specifically comprises the following steps:

PSA attention first groups input data into S groups by lanes; group convolution operations with different sizes of convolution kernels per group, number of groupsThe operation obtains receptive fields with different scales in a light-weight mode, and extracts information with different scales from the image; extracting the weighted value of the channel in each group through an SE Weight module, and finally carrying out softmax normalization and weighting on the weighted value of the S group to adjust the information weights of different scales on each feature map; through weighting processing, the model can pay more attention to the characteristic area which is more critical to target positioning, and the performance and accuracy of target detection are improved.

4. A modified YOLOv8 s-based object detection algorithm according to claim 3, wherein: the step 6) specifically includes:

t＝s ^α ×u ^β (3)

wherein alpha and beta are weight super parameters, s is a prediction value corresponding to the labeling category, u is a combination ratio IoU of the prediction frame and the real frame, the alignment degree can be measured by multiplying the two, and t can simultaneously control the classification score and the optimization of IoU to realize task alignment;

classification branching uses VFL Loss as a Loss function, and the formula for VFL Loss is expressed as:

where p is IACS of the predicted image and q is the target score; when the prediction frame is a positive sample, q is IoU of the target prediction frame and the real frame, the algorithm uses a common Loss function BCE Loss, and adds an adaptive IOU weight for highlighting the positive sample; when the prediction frame is a negative sample, q=0, and the algorithm uses the Loss function Focal Loss to solve the problem of unbalance of the positive and negative samples;

regression branching uses CIOU Loss and DFL Loss together as a Loss function, where CIOU is formulated as:

wherein ρ is ² (b,b ^gt ) Representing the Euclidean distance between the model prediction frame and the real frame, c represents the length of the minimum circumscribed rectangular diagonal line of the model prediction frame and the real frame, alpha is a parameter for balancing the proportion, and v is used for measuring the proportion consistency between the anchor frame and the target frame; and the formula of DFL Loss is as follows:

DFL(S _i ，S _i+1 )＝-((y _i+1 -y)log(S _i )+(y-y _i )log(S _i+1 )) (8)

y is the target label value, y _i ，y _i+1 Is two integers y closest to y _i ≤y≤y _i+1 The DFL optimizes probability distribution of a target y position adjacent area in a cross entropy mode, and calculates weights of left and right integer coordinates closest to the target y position adjacent area in a linear interpolation mode, so that the network is focused to the distribution of the target y position adjacent area more quickly, and learning capacity of the network on the distribution of the target y position adjacent area is enhanced; finally, the 3 loss functions are weighted by a certain weight proportion to be used as the total loss function of the network.