CN116630932A

CN116630932A - Road shielding target detection method based on improved YOLOV5

Info

Publication number: CN116630932A
Application number: CN202310423047.8A
Authority: CN
Inventors: 熊炫睿; 徐稳; 张宇樊; 方海领; 林为琴; 陈怡�
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2023-04-20
Filing date: 2023-04-20
Publication date: 2023-08-22

Abstract

The application belongs to the field of road target detection, and particularly relates to a road shielding target detection method based on improved YOLOV 5. The method comprises the following steps: s1) constructing a road target data set, and dividing the data set into a training set, a verification set and a test set; s2) processing the road target training set by a Mixup data enhancement method, and enriching the number of road shielding target training samples; s3) constructing an improved YOLOV5 shielding target detection model; s4) inputting the training data in the step S2 into the step S3 model for training; s5) inputting the images of the test set into the model trained in the step S4 for detection, and outputting detection results, namely the boundary frame position parameters and the target category information of the targets in the images. The method has higher target detection precision improvement under the condition of dense road shielding, and can reduce the omission ratio between road targets caused by dense arrangement.

Description

Road shielding target detection method based on improved YOLOV5

Technical Field

The application belongs to the field of road target detection, and particularly relates to a road shielding target detection method based on improved YOLOV 5.

Background

With the development of intelligent driving assistance technology and automatic driving technology, road shielding target detection becomes more and more important. Road blocking objects include trees, shrubs, road signs, pedestrians, vehicles, and other obstacles that may obstruct the view of an autonomous car and increase driving risk. Therefore, developing a high-efficiency and accurate road shielding target detection technology has become an important research direction in the field of automatic driving. Currently, the technical difficulties of road shielding target detection are mainly embodied in the following aspects:

1. deformation caused by shielding: when the target is shielded by other objects, a part of the area of the target may be shielded or deformed, so that the shape and the size of the target are changed, which greatly affects the accuracy of target detection;

2. complexity of the background: when the target is occluded by other objects, the background around it may become more complex, which may have a great influence on the accuracy of target detection. Therefore, modeling and processing of the background is required in the target detection process;

3. mutual shielding of objects: in an actual scene, not only the situation that a target is blocked by other objects, but also the situation that a plurality of objects are mutually blocked can occur, which can bring great influence to the accuracy of target detection;

4. quality of the dataset: the performance of occlusion object detection depends largely on the quality of the training dataset. Because objects in the occlusion object detection dataset are typically occluded, the annotation of the dataset is difficult, which can make training and testing of the model very difficult.

The existing occlusion target detection technology mainly comprises a traditional detection algorithm and a detection algorithm based on deep learning. The traditional road shielding target detection technology is mainly based on object detection technology in computer vision, such as a sliding window, a HOG, SIFT, SURF and other feature extraction algorithms and SVM and other classifiers. However, these techniques often require manual feature extraction and suffer from over-fitting, poor generalization ability, and the like. The development of deep learning technology brings new progress to road shielding target detection, and in particular, an object detection algorithm based on a Convolutional Neural Network (CNN) has been widely applied to road target detection. The YOLOV5 is used as a novel target detection algorithm and has the characteristics of high efficiency and accuracy. However, YOLOV5 was originally only applied to the research of target detection technology in general scenes, and the performance of detection on complex road shielding scenes is still lacking.

In conclusion, the road shielding target detection technology has important significance and value in the fields of automatic driving, traffic management, safety monitoring and the like. The road condition can be better known and mastered by the automobile and the driver, and the road safety and the traffic efficiency are improved.

Disclosure of Invention

In order to solve the defects of the existing target detector in a road shielding scene, the application provides a road shielding target detection algorithm based on YOLOV 5. Effective detection of road occlusion targets is achieved by improving the YOLOV5 target detection algorithm.

In order to achieve the above purpose, the present application adopts the following technical scheme: the method for detecting the road shielding target based on the YOLOV5 comprises the following steps in sequence:

s1, constructing a road target data set, and dividing the data set into a training set, a verification set and a test set;

s2, processing a road target training set by a Mixup data enhancement method, and enriching the number of road shielding target training samples;

s3, constructing an improved YOLOV5 shielding target detection model;

s4, inputting training data in the step S2 into the model of the step S3 for training;

s5, inputting the images of the test set into the model trained in the step S4 for detection, and outputting detection results, namely the boundary frame position parameters and the target category information of the targets in the images.

The further step S1 specifically includes the following steps:

s1.1: collecting road target pictures, wherein the collecting path can be a vehicle recorder or a mobile camera, and a data set for model training is constructed;

s1.2: and (3) labeling the picture in the step S1.1 by using LabelImg software, wherein the labeling format is Pascal VOC, namely, the suffix is an xml format label file. And according to 8:1: the scale of 1 divides it into a training set, a validation set and a test set.

Further, the step S2 specifically includes:

s2.1: converting the training data set into an RGB format image, and simultaneously adjusting the image to a uniform size required by a model, such as 640 multiplied by 640;

s2.2: and (3) performing data enhancement processing on the training data in the step S2.1 by using a Mixup data enhancement method.

Further, the step S3 specifically includes the following steps:

s3.1: optimizing the backbone network of YOLOV5, in particular using a deformable convolution DCNv2 (Deformable ConvolutionNetworks v 2) to improve the normal convolution in the CSPLayer block of YOLOV 5;

s3.2: a receptive field enhancement module RFEM (Receptive field enhancement module) designed by utilizing cavity convolution is added among three effective feature layers of the optimized YOLOV5, so that an optimized path aggregation network PANet is constructed;

s3.3: and optimizing the predicted post-processing stage of the YOLOV5 by using an EIOU distinguishing method and a flexible non-maximum suppression algorithm.

Further, the step S4 specifically includes the following steps:

s4.1: configuring a training environment of the model;

s4.2: setting model training parameters, specifically comprising:

setting an optimizer for model training as SGD, setting an initial learning rate as 0.001, setting the minimum learning rate in the training process as 0.1 times of the initial learning rate, and setting a training strategy as freezing training. The training loss functions include a confidence loss function, a classification loss function, and a prediction block regression loss function. Wherein the confidence loss function and the classification loss function are cross entropy loss functions and the regression frame loss function is a GIOU loss function.

Specifically, the confidence loss function mathematical expression is as follows:

wherein S represents the size of the current feature map; m represents the number of anchor frames on each feature point on the feature map;the true confidence coefficient parameter is 1 when the prediction frame contains an object, and is 0 otherwise; c (C) _i Is a predicted value; lambda (lambda) _noobj The bounding box value representing the negative sample defaults to 0.5.

The mathematical expression of the classification confidence loss function is as follows:

wherein, when the jth anchor frameIf 1, calculating the classification loss of the boundary box generated by the anchor box; p (P) _i (c) A predicted probability value for category c; />When 1 or 0 is used, 1 represents that the bounding box is used for detecting objects.

The mathematical expression of the regression box loss function GIOU loss function is as follows:

wherein, the liquid crystal display device comprises a liquid crystal display device,a is a prediction frame, B is a real frame; c represents the smallest circumscribed rectangular area that can simultaneously enclose A, B.

Step S4.3: training the model in the step S3 by using the training data enhanced by the step S2.2, testing the verification set image by using the model in the training, and obtaining the trained model when the precision and the loss function of the verification set tend to be unchanged.

Further, in the step S5 of the present application, the test set picture is input into the trained optimized YOLOV5 model, the model will adjust the picture to be detected to a uniform size, if the size of the picture to be detected is smaller than the set size, the periphery of the picture to be detected is added with gray strips, and if the size is larger than the set size, the picture to be detected is compressed. The output of the model will obtain three sizes of detection results, the scales of which are (20×20), (40×40) and (80×80), respectively, and finally the model fuses the three sizes of detection results to obtain the prediction candidate frames with the number [ (20×20) + (40×40) + (80×80) ] ×3, namely 25200. Assuming that there are 10 categories of road targets to be detected, the trained road target detection model will represent the output result as a two-dimensional vector (25200,15). 15 includes the number of target classes 10, the location parameters (x, y, w, h) of the detection frame, and 1 confidence parameter. And filtering the detection frames with predicted values lower than a set threshold value through an optimized non-maximum suppression algorithm, wherein the reserved detection frames are the final detection result.

The beneficial effects are that:

(1) According to the application, through a data enhancement method, the background diversity of training samples can be effectively enriched, and the generalization capability of the model is improved;

(2) According to the application, the deformable convolution is used for optimizing the main network of the YOLOV5, so that the feature extraction capability of the main network on deformation, truncation and the like of an object in a shielding environment can be improved, and the detection precision of a shielding target is improved;

(3) The application can better learn the relation between the shielded target and the surrounding environment by utilizing the receptive field enhancement module RFEM designed by cavity convolution, and improve the characteristic expression capability of the shielded target;

(4) According to the application, the EIOU evaluation method and the flexible non-maximum suppression algorithm are combined to optimize the prediction post-processing of the model, so that the problem that a prediction frame is erroneously suppressed under the dense shielding condition can be avoided, and the omission ratio of a shielding target is reduced.

Drawings

For the purpose of illustrating the application more clearly, technical solutions and advantages, the present application will be described in detail below with reference to the accompanying drawings.

FIG. 1 is a flow chart of the method of the present application;

FIG. 2 is a Mixup data enhancement schematic;

FIG. 3 is a schematic diagram of a deformable convolution DCNv2

FIG. 4 is a schematic diagram of a backbone network modified with a deformable convolution DCNv 2;

FIG. 5 is a schematic view of the structure of the receptive field enhancement module RFEM of the application;

fig. 6 is a schematic diagram of the overall structure of the improved YOLOV5 of the present application.

Detailed Description

Hereinafter, preferred examples of the present application will be described in detail with reference to the accompanying drawings. The examples are only some, but not all, of the embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

As shown in fig. 1, a road blocking object detection method based on the modified YOLOV5, the method comprises the following sequential steps:

and S1, constructing a road target data set, and dividing the data set into a training set, a verification set and a test set. The method comprises the steps that an acquisition mode of a data set picture can be through a vehicle-mounted recorder or a vehicle-mounted camera, then the acquired image is marked by LabelImg software, the marked format is a Pascal VOC format, namely, a marked file takes xml as a suffix;

s2, processing the training set by a Mixup data enhancement method, and enriching the number of training samples of the road shielding target;

s3, constructing an improved YOLOV5 shielding target detection model;

s4, inputting the training set enhanced in the step S2 into the model in the step S3 for training, and updating model parameters by using back propagation;

and S5, inputting the image of the test set into the improved YOLOV5 model trained in the step S4 for detection, and outputting detection results, namely the boundary box position information of the target in the image and the category information of the target.

In the present application, the step S1 specifically includes:

preprocessing the acquired driving scene data set; firstly, labeling a data set picture through LabelImg software, and labeling common automobiles such as cars, trucks, vans and the like as 'Car'; labeling pedestrians as 'petestrian'; the object of riding a bicycle, motorcycle is designated 'Cyclist'. The markup file is stored in a Pascal VOC format, namely, the suffix of the markup file is 'xml'. Placing the marked pictures and the marked files according to the VOC format path, and dividing the data set into a training set, a verification set and a test set, wherein the dividing ratio is 8:1:1.

in the present application, the data enhancement in step S2 is shown in fig. 2, and the Mixup data enhancement method can be described by the following two formulas:

wherein x is _i And x _j Representing the image before enhancement; y is _i And y is _j Is a corresponding label;and->Respectively an enhanced image and a label; lambda takes a fraction between 0 and 1.

In the application, in the step S3, an improved YOLOV5 shielding target detection model is constructed specifically comprising the following parts:

fig. 3 is a schematic diagram of a deformable convolution. In an actual shielding environment, the available characteristics of an object mostly show the characteristics of truncation and deformation, and the deformable convolution DCNv2 can enhance the recognition capability of the model on the object with large geometric deformation degree due to the fact that the position of the convolution kernel parameter matrix can be adjusted. The way in which the deformable convolution DCNv2 extracts the object features is as follows:

wherein x (p) and y (p) represent features at position p in the input feature map x and the output feature map y, respectively; k represents the number of sampling positions of the convolution kernel; w (w) _k And p is as follows _k Respectively representing the weight of the kth sampling position and a pre-designated offset; Δp _k And Δm _k Respectively representing a learnable offset and a modulation scalar for the kth sample position.

Fig. 4 is a schematic diagram of a backbone network optimized by using a deformable convolution DCNv2 according to the present application. The present application introduces a deformable convolution DCNv2 into the backbone network of YOLOV5, since a 1 x 1 deformable convolution does not have the learning ability of the offset parameters, a 3 x 3 deformable convolution is used in all the present application to optimize the backbone network of YOLOV 5. The optimized backbone network is obtained by stacking a Focus module, a CBS layer, a CSPDLayer layer and an SPP layer, wherein the CSPDLayer is a convolution layer optimized by using a deformable convolution DCNv 2.

Fig. 5 shows a receptive field enhancement module RFEM module according to the application. The RFEM module consists of 3 parts, wherein the first part is a multi-scale cavity convolution; the second part is the ECA attention mechanism; the third part is a common 1 x 1 convolution. In the application, the void ratios forming the multi-scale void convolution are respectively 1, 2, 3 and 4, and the corresponding receptive field calculation modes are as follows:

RF＝d×(k-1)+1

wherein RF refers to receptive field size; d is the void ratio of the void convolution; k is the size of the convolution kernel. In addition, the number of the cavity convolutions set by the application is 1/4 of the input feature map, and in the output stage of the first part, the RFEM module fuses the input feature map and the feature map output by the multi-scale cavity convolutions at the same time, and the first part of the output feature map of the RFEM is expressed by the following formula assuming that the input feature map is expressed as F:

F1＝Concat([F,DC _d＝1 (F),DC _d＝2 (F),DC _d＝3 (F),DC _d＝4 (F)])

wherein F1 is an output characteristic diagram; concat means splice according to channel direction; DC () represents the hole convolution.

The second part of RFEM is ECA attention mechanism, which is used to enhance channel correlation and suppress useless channel information. Firstly, carrying out global average pooling processing on an input feature map, then extracting channel dimension information by utilizing self-adaptive one-dimensional convolution, then carrying out normalization processing on the extracted channel information by utilizing a Sigmoid activation function to obtain a channel attention weight vector, and finally, weighting the input feature map by utilizing the obtained attention weight vector and taking the weighted input feature map as input of the next layer. The process of ECA treatment is described as follows:

wherein F2 represents a feature map of ECA output; AConv1D is a self-adaptive one-dimensional convolution with a convolution kernel of k; GAP represents global average pooling. Wherein the convolution kernel size k of AConv1D is determined by:

wherein C represents the number of channels of the input feature map; i _odd Meaning that k can only be taken down an odd number; gamma and b are constants, and default to 2 and 1, respectively.

The third part of the RFEM uses a common convolution of 1 x 1 size to adjust the number of channels. In the present application, the first two parts of the RFEM expand the input feature map to twice the original, where, to maintain the number of channels of the input feature map unchanged, the third part of the RFEM uses a 1×1 common convolution to restore the number of channels of the feature map to the size before input, so as to integrate the RFEM into YOLOV 5.

Fig. 6 is an overall structure of the road blocking object detection model based on the modified YOLOV5 of the present application. In the post-processing stage of model prediction output, the application uses an EIOU evaluation method and a flexible non-maximum suppression algorithm to suppress redundant bounding boxes, wherein the EIOU evaluation method and the flexible non-maximum suppression algorithm are Soft-EIOU-NMS algorithm. The Soft-EIOU-NMS algorithm distinguishes redundant prediction boxes by using the EIOU method while updating the confidence score using the Soft-NMS algorithm that is not very flexible. Wherein the confidence score is updated by:

wherein s is _i Is a confidence score; m is a reference prediction frame, namely a prediction frame with the maximum current confidence score value; b _i For the remaining other prediction blocks. f () is a gaussian function whose expression is:

wherein σ is a constant. The EIOU is calculated as follows:

wherein ρ is ² (. Cndot.) represents the calculated center point Euclidean distance; c is M and b _i Diagonal length of the minimum bounding rectangle; w is the width of M; w (w) _i B is _i Is of the width of (a); h is M high; h is a _i B is _i Is high of (2); c _w The width of the minimum circumscribed rectangle; c _h Is the height of the minimum bounding rectangle. IOU is the area cross ratio of the reference frame and the rest of the predicted frames, and the expression is as follows:

the specific process of the Soft-EIOU-NMS algorithm of the application is shown in the following table:

in the application, in the step S4, the training set enhanced in the step S2 is input into the model in the step S3 for training, and the model parameters are updated by utilizing direction propagation. The method specifically refers to the following steps:

s4.1: configuring a training environment of the model;

s4.2: setting model training parameters, specifically comprising:

wherein, the liquid crystal display device comprises a liquid crystal display device,a is a prediction frame, B is a real frame; c represents the smallest area that can simultaneously enclose A, B.

Step S4.3: training the model in the step S3 by using the training data after the step S2 is enhanced, testing the verification set image by using the model in the training, and obtaining the trained model when the precision and the loss function of the verification set tend to be unchanged.

In the application, step S5, the image of the test set is input into the improved YOLOV5 model trained in step S4 for detection, which specifically means that:

inputting the test set picture into the trained optimized YOLOV5 model, firstly, adjusting the picture to be detected to a uniform size by the model, adding gray strips around the picture to be detected if the size of the picture to be detected is smaller than a set size, and compressing the picture to be detected if the size of the picture to be detected is larger than the set size. The output of the model will obtain three sizes of detection results, the scales of which are (20×20), (40×40) and (80×80), respectively, and finally the model fuses the three sizes of detection results to obtain the prediction candidate frames with the number [ (20×20) + (40×40) + (80×80) ] ×3, namely 25200. Assuming that there are 10 categories of road targets to be detected, the trained road target detection model will represent the output result as a two-dimensional vector (25200,15). 15 includes the number of target categories 10, the location parameters (x, y, w, h) of the detection frame, and 1 confidence score. And then the redundant detection frames are restrained through a soft-EIOU-NMS algorithm, and the reserved detection frames are the final detection result.

The foregoing is merely a specific implementation of the application, it will be appreciated by those skilled in the art that various changes, modifications, substitutions and alterations may be made to these embodiments without departing from the principles and spirit of the application, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A road shielding target detection method based on improved YOLOV5 is characterized by comprising the following steps of: the method comprises the following steps:

s3, constructing an improved YOLOV5 shielding target detection model;

s4, inputting the training data in the step S2 into the step S3 model for training;

s5, inputting the images of the test set into the model trained in the step S4 for detection, and outputting a detection result.

2. A road blocking object detection method based on improved YOLOV5 according to claim 1, characterized in that: in the step S1, specifically, the method includes:

s1.1, collecting road target pictures, wherein the collecting path can be a vehicle recorder or a mobile camera, and a data set for model training is built;

and S1.2, labeling the pictures in the step S1.1 by using LabelImg software, wherein the labeling format is Pascal VOC, namely, the suffix is an xml format label file. And according to 8:1: the scale of 1 divides it into a training set, a validation set and a test set.

3. A road blocking object detection method based on improved YOLOV5 according to claim 1, characterized in that: in the step S2, the road target training set is processed by a mix up data enhancement method, so as to enrich the number of road shielding target training samples, which specifically includes the following contents:

step S2.1, converting the training data set into an RGB format image, and simultaneously adjusting the image to a uniform size required by a model, such as 640 multiplied by 640;

and step S2.2, performing data enhancement processing on the training data in the step S2.1 by a Mixup data enhancement method. Record x _i And x _j Representing two images before enhancement, respectively; y is _i And y _j Two image labels respectively; mixup data enhanced imageAnd tag->Can be described by the following two formulas:

4. a road blocking object detection method based on improved YOLOV5 according to claim 1, characterized in that: in the step S3, an improved YOLOV5 occlusion target detection model is constructed, which includes the following steps:

step S3.1, optimizing the backbone network of YOLOV5, in particular using a deformable convolution DCNv2 to improve the normal convolution in the CSPLayer block of YOLOV5, in particular using a deformable convolution of 3 x 3 to optimize the backbone network of YOLOV 5. The optimized backbone network is obtained by stacking a Focus module, a CBS layer, a CSPDLayer layer and an SPP layer, wherein the CSPDLayer is a convolution layer optimized by using a deformable convolution DCNv 2.

And S3.2, adding a receptive field enhancement module RFEM (Receptive field enhancement module) designed by utilizing cavity convolution among three effective feature layers of the optimized YOLOV5 to construct an optimized path aggregation network PANet. The specific RFEM comprises 3 parts, wherein the first part is a multi-scale cavity convolution; the second part is the ECA attention mechanism; the third part is a common 1 x 1 convolution. In the present application, the void ratios constituting the multi-scale void convolution are 1, 2, 3, and 4, respectively. Assuming that the input feature map is denoted as F, the feature map of the first part is fused by using a Concat, and DC represents the hole convolution, the output feature map F1 of the first part of the RFEM is represented by the following formula:

F1＝Concat([F,DC _d＝1 (F),DC _d＝2 (F),DC _d＝3 (F),DC _d＝4 (F)])

the second part firstly carries out global average pooling GAP processing on the input feature map, then utilizes the self-adaptive one-dimensional convolution AConv1D to extract channel dimension information, then utilizes the Sigmoid activation function to normalize the extracted channel information to obtain a channel attention weight vector, and finally uses the obtained attention weight vector to weight the input feature map and serve as input of the next layer. The output profile F2 of the second section is described as:

the third section uses a common convolution of size 1 x 1 to adjust the number of channels. In the present application, the first two parts of the RFEM expand the input feature map to twice the original, where, to maintain the number of channels of the input feature map unchanged, the third part of the RFEM uses a 1×1 normal convolution to restore the number of channels of the feature map to the size before input.

And step S3.3, optimizing the post-prediction processing stage of the YOLOV5 by using an EIOU distinguishing method and a soft non-maximum suppression algorithm.

5. A road blocking object detection method based on improved YOLOV5 according to claim 1, characterized in that: in the step S4, the training data in the step S2 is input into the step S3 model for training, and specifically includes the following steps:

s4.1, configuring a training environment of a model;

step S4.2, setting model training parameters, which specifically comprises the following steps:

6. A road blocking object detection method based on improved YOLOV5 according to claim 1, characterized in that: in the step S5, the image of the test set is input into the model trained in the step S4 for detection, and the detection result is output, which specifically includes the following contents:

inputting the test set picture into the trained optimized YOLOV5 model, firstly, adjusting the picture to be detected to a uniform size by the model, adding gray strips around the picture to be detected if the size of the picture to be detected is smaller than a set size, and compressing the picture to be detected if the size of the picture to be detected is larger than the set size. The output of the model will obtain three sizes of detection results, the scales of which are (20×20), (40×40) and (80×80), respectively, and finally the model fuses the three sizes of detection results to obtain the prediction candidate frames with the number [ (20×20) + (40×40) + (80×80) ] ×3, namely 25200. Assuming that there are 10 categories of road targets to be detected, the trained road target detection model will represent the output result as a two-dimensional vector (25200,15). 15 includes the number of target classes 10, the location parameters (x, y, w, h) of the detection frame, and 1 confidence parameter. And filtering the detection frames with predicted values lower than a set threshold value through an optimized non-maximum suppression algorithm, wherein the reserved detection frames are the final detection result.