CN115731533B

CN115731533B - Vehicle-mounted target detection method based on improved YOLOv5

Info

Publication number: CN115731533B
Application number: CN202211506283.8A
Authority: CN
Inventors: 张青春; 蒋方呈; 高峰; 王文聘; 张洪源; 文张源; 张宇
Original assignee: Huaiyin Institute of Technology
Current assignee: Huaiyin Institute of Technology
Priority date: 2022-11-29
Filing date: 2022-11-29
Publication date: 2024-04-05
Anticipated expiration: 2042-11-29
Also published as: CN115731533A

Abstract

According to the vehicle-mounted target detection method based on the improved Yolov5, the Yolov5 network structure is improved, and the obstacle detection of a complex road is realized; the specific operation steps are as follows: step 1: collecting a front image of a vehicle through a camera; step 2: the video streams acquired by the cameras are respectively subjected to key frame extraction to acquire a picture data set for subsequent model training; preprocessing the collected picture data set, and dividing the picture data set into a training set, a testing set and a verification set according to a proper proportion; step 3: configuring a related environment, building an improved Yolov5 network structure, and putting the processed picture training set, the picture test set and the picture verification set into the improved Yolov5 for training; after training is completed, a best test. Pt model with the best detection effect is obtained; step 4: and placing the image to be detected into a best. Pt model to obtain a detection result. The invention can keep higher recognition accuracy under the conditions of small targets and low resolution, and improves the accuracy of target detection.

Description

Vehicle-mounted target detection method based on improved YOLOv5

Technical Field

The invention relates to the technical field of computer image processing, in particular to a vehicle-mounted target detection method based on improved YOLOv 5.

Background

With the rapid development of the logistics industry and the rapid growth of the travel demands of people, the road transportation industry of China is rapidly developing. With the rapid development of the transportation industry, the road condition is complex, and the frequency of traffic accidents is also higher and higher. At present, a vehicle-mounted obstacle recognition system mostly adopts a laser radar, an ultrasonic sensor and the like; the laser radar, the ultrasonic sensor and other equipment have higher cost, large calculation workload and inconvenient deployment and use.

In the field of target detection, the current mainstream is to use a deep learning neural network, and enable the network to have the capability of target recognition through training. The current mainstream target detection network has two structures, one is a one-stage type network represented by YOLO and one is a two-stage type network represented by Fast-RCNN. the two-stage type network extracts the region of interest from the input image, locates the target first, then extracts the characteristics of each region of interest, and finally uses a multi-classifier to identify the category of each region, wherein the detection accuracy is higher but the detection speed is slower; the one-stage type network integrates the positioning and classification into one network to be completed independently, so that the detection speed is greatly improved, but a part of detection precision is lost.

The existing YOLO network has a faster development trend, has a larger improvement in detection speed and accuracy, and is not weaker than the two-stage type network. YOLOv5 is used as the latest version of the current YOLOv network series, the performance is obviously improved compared with the prior version, but the detection method has the defects in low resolution and small target detection, and the detection precision under the complex condition is easy to interfere.

Disclosure of Invention

In order to solve the problem that the YOLOv5 has defects in low resolution and small target detection and the detection precision is easy to be interfered under the complex condition, the invention provides a vehicle-mounted target detection method based on improved YOLOv5, which uses an SPD-Conv structure to replace the original Conv structure and improves the precision of identifying the low resolution and small target; the technical problems can be effectively solved.

The invention is realized by the following technical scheme:

according to the vehicle-mounted target detection method based on the improved Yolov5, the Yolov5 network structure is improved, and the obstacle detection of a complex road is realized; the specific operation steps are as follows:

step 1: collecting a front image of a vehicle through a camera;

step 2: the video streams acquired by the cameras are respectively subjected to key frame extraction to acquire a picture data set for subsequent model training; preprocessing the collected picture data set, and dividing the picture data set into a training set, a testing set and a verification set according to a proper proportion;

step 3: configuring a related environment, building an improved Yolov5 network structure, and putting the processed picture training set, the picture test set and the picture verification set into the improved Yolov5 for training; after training is completed, a best test. Pt model with the best detection effect is obtained;

the improved Yolov5 network structure is built, the Yolov5 is improved, and the improvement points of the Yolov5 are as follows: replacing the original neck network of Yolov5 with a weighted bidirectional pyramid network BiFPN for feature extraction; introducing an attention mechanism in a backbone network, adding a CBAM module, and combining the attention mechanisms of two dimensions of a feature channel and a feature space; the SPD-Conv module is used for replacing the original CNN module to obtain a Yolov5-SPD module which is used for processing low-resolution and smaller targets; replacing the original IoU function with an EIoU loss function; replacing the SilU activation function with the Mish activation function;

step 4: and placing the image to be detected into a best. Pt model to obtain a detection result.

Further, in the step 3, a feature fusion neck network is introduced into the Yolov5 backbone network, and feature extraction is performed by a weighted bidirectional pyramid network BiFPN, which specifically comprises the following operation modes: introducing a CBAM convolution attention module into a Backbone network of a Yolov5 network, wherein the CBAM convolution attention module combines a channel attention mechanism and a space attention mechanism; the method comprises the steps that a Backbone network of a backhaul extracts features, a CBAM module carries out global average pooling and global maximum pooling on single feature layers in input feature layers respectively by a focus mechanism of a channel, converts the single feature layers into two 1x1 forms, adds the results of the global average pooling and the global maximum pooling by using a full connection layer, carries out sigmoid operation on the added results to obtain a weight of each feature channel, and multiplies the weight by an original feature layer to obtain features of the channel;

the attention mechanism of the CBAM module for the space is characterized in that the maximum value and the average value of each feature point on an input feature layer are taken, the maximum value and the average value are stacked, a single feature layer is converted into 2 channels, the number of the channels is adjusted by convolution with the number of the channels being 1 again, the single feature layer is converted into 1 channel again, sigmoid operation is carried out on the processed feature points, the weight of each feature point is obtained, and the feature of the feature point can be obtained by multiplying the weight with the feature point on the original feature layer;

the attention mechanism highlights the key part in the characteristics, simultaneously focuses on the position information and semantic information of the target, introduces the attention mechanism in both the bottom characteristic layer and the high-level characteristic layer of the Backbone network of the Backbone, namely adds a CBAM module in the 6 th layer, the 11 th layer, the 16 th layer and the last layer, highlights the bottom and high-level characteristic information, and introduces the CBAM module in the last layer of the Backbone network of the Backbone to meet the requirement of a subsequent Neck bottleneck structure.

Furthermore, the CBAM module is introduced into the last layer of the Backbone network of the backhaul to meet the requirement of the subsequent Neck bottleneck structure, and the specific operation mode is as follows: the bi-directional pyramid network BiFPN introduces a learnable weight to the features of different scales so as to better balance the feature information of the different scales; namely, introducing a learnable weight parameter O to the features of different scales to control the weight of each layer of features, wherein the specific distribution mode of O is as follows:

wherein w is _i Activating a function for the weight through the SiLU to be more than or equal to 0; let epsilon=0.0001 prevent numerical instability;

the feature layer is subjected to feature fusion in a weighted mode, and the specific mode is as follows:

wherein P is _i ^td Is P _i Intermediate properties of layers, P _i ^out Is P _i Layer output characteristics, resize will P _i-1 、P _i+1 Feature layer conversion to and P _i The same dimensions.

Further, the SPD-Conv module in the step 3 is used for replacing the original CNN module to obtain a Yolov5-SPD module; the Yolov5-SPD module comprises an SPD layer and a non-strided convolutio layer; the SPD layer performs downsampling on the original feature map, cuts a certain feature map according to proportion to obtain a series of sub-feature maps, and splices the sub-feature maps according to channels to obtain an intermediate feature map, wherein the specific mode is as follows:

f _m-1,n-1 ＝X[scale-1:m:scale,scale-1:n:scale]；

wherein X is an original feature map, the size is m multiplied by n, and scale is a scaling factor;

the non-strided convolutio layer uses a non-stride convolution mode to keep the feature information for discrimination as far as possible, and simultaneously controls the depth and the width of the intermediate feature map to meet the requirements of the depth and the width of the subsequent network; the Yolov5-SPD module is used for replacing the original CNN for processing the low-resolution and smaller targets, so that the precision of identifying the low-resolution and smaller targets can be improved.

Further, replacing the original IoU function with the EIoU loss function in the step 3, where the GIoU in the EIoU loss function can split the loss term of the aspect ratio into the difference value between the predicted width and height and the minimum external frame width and height respectively; meanwhile, focal Loss is introduced, so that the optimized contribution of a large number of anchor frames which are less overlapped with the target frame to BBox regression is reduced, and the regression process is more focused on high-quality frames, wherein the specific formula is as follows:

E _loss ＝IoU _loss +dis _loss +asp _loss

wherein dis _loss As center point loss, asp _loss For length and width loss ρ ² (b,b ^gt ) Euclidean distance, ρ representing the center point of the predicted and real frames ² (w,w ^gt )、ρ ² (h,h ^gt ) Representing the Euclidean distance of the width and height of the prediction and real frames, respectively, c representing the diagonal distance of the minimum enclosed region containing both the prediction and real frames, c _w Representing the width of the smallest closed area containing both the predicted and real frames, c _h Representing the height of the smallest closed area containing both the predicted and real frames.

Further, in the step 3, the function of activating Mish is used to replace the SiLU activation function, the function of activating Mish has a lower bound, and the negative half-axis has smaller weight, so that the occurrence of neuron necrosis phenomenon can be prevented, and a stronger regularization effect can be generated; a small amount of negative information is reserved, so that the phenomenon of the Dying ReLU of the ReLU is avoided, and better expression and information flow are facilitated; the specific formula of the Mish activation function is:

Mish(x)＝x*Tanh(Softplus(x))；

where Tanh is a hyperbolic tangent function, softplus is an activation function, which can be seen as a smoothing of ReLu.

Further, before the processed picture training set, the picture testing set and the picture verifying set are put into the improved Yolov5 for training in the step 3, network training parameters need to be set, and the specific operation mode is as follows: the iteration number is set to 200, the bitchsize is 16, and the initial learning rate is 0.0001.

Further, in the step 1, the front image of the vehicle is collected by the camera, and the specific operation mode is as follows: the camera is arranged at the top of the vehicle and used for collecting images in front of the vehicle; during the running process of the vehicle, the camera can collect video streams in front of the vehicle.

Further, in step 2, the video streams collected by the camera are respectively extracted into key frames, that is, the current frames are extracted from the video streams collected by the camera at intervals of 1s and are used as key frames, and the key frames are stored in the picture data set.

Further, the preprocessing of the collected picture data set in step 2 specifically includes: removing pictures which do not contain targets, have fuzzy characteristics and have disordered backgrounds; labeling the screened pictures, labeling targets to be detected on the pictures, such as culverts, height limiting rods, trees and other barriers by using rectangular frames, recording names of the targets and coordinates of the rectangular frames, and generating txt files for storage; finally, dividing the picture data set into a training set, a testing set and a verification set according to the ratio of 7:2:1.

Advantageous effects

Compared with the prior art, the vehicle-mounted target detection method based on the improved YOLOv5 has the following beneficial effects:

(1) According to the technical scheme, the vehicle-mounted camera and the embedded equipment are directly mounted on the vehicle, so that additional hardware is not needed, and the hardware cost is saved; the camera is used for collecting the front image of the vehicle, inputting the image into the model, and judging whether a certain obstacle exists in front of the vehicle, so that the purpose of monitoring the road condition of the vehicle is achieved. In addition, replacing the original neck network of Yolov5 with a weighted bidirectional pyramid network BiFPN for feature extraction; introducing an attention mechanism in a backbone network, adding a CBAM module, and combining the attention mechanisms of two dimensions of a feature channel and a feature space; the SPD-Conv module is used for replacing the original CNN module to obtain a Yolov5-SPD module, and the method is used for processing low-resolution and small targets to improve the accuracy of identifying the low-resolution and small targets; replacing the original IoU function with an EIoU loss function; the loss function adopts EIoU, so that the difference between the prediction frame and the real frame can be calculated more effectively, and the precision of the model is improved.

(2) According to the technical scheme, an improved BiFPN bidirectional weighted pyramid network is introduced at the neg end, learnable weights are introduced for different scale features to better balance the feature information of different scales, a SiLU activation function is replaced by a Mish activation function, the Dying ReLU phenomenon of ReLU is avoided, and better expression and information flow are facilitated.

(3) According to the technical scheme, the original IoU function is replaced by the EIoU loss function, so that the problem of sample unbalance in the bounding box regression task is solved, namely the optimized contribution of a large number of anchor boxes which are less overlapped with the target box to BBox regression is reduced, and the regression process is more focused on high-quality boxes.

Drawings

FIG. 1 is a schematic overall flow chart of the present invention.

FIG. 2 is a diagram of an improved YOLOv5 network architecture in accordance with the present invention.

FIG. 3 is a schematic diagram of a CBAM attention module according to the present invention.

FIG. 4 is a diagram of a BiFPN network architecture in accordance with the present invention.

FIG. 5 is a schematic diagram of a Yolov5-SPD module according to the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments.

Example 1:

1-5, an improved Yolov5 network structure is improved to realize obstacle detection of a complex road based on an improved Yolov5 vehicle-mounted target detection method; the specific operation steps are as follows:

step 1: collecting a front image of a vehicle through a camera;

the camera is arranged at the top of the vehicle and used for collecting images in front of the vehicle; during the running process of the vehicle, the camera can collect video streams in front of the vehicle.

Step 2: and respectively extracting key frames from the video streams acquired by the cameras, namely extracting current frames from the video streams acquired by the cameras at intervals of 1s as key frames, and storing the key frames into a picture data set. Acquiring a picture data set of subsequent model training; preprocessing the collected picture data set to remove pictures which do not contain targets, have fuzzy characteristics and have disordered backgrounds; labeling the screened pictures, labeling targets to be detected on the pictures, such as culverts, height limiting rods, trees and other barriers by using rectangular frames, recording names of the targets and coordinates of the rectangular frames, and generating txt files for storage; finally, dividing the picture data set into a training set, a testing set and a verification set according to the ratio of 7:2:1.

Step 3: configuring a related environment, building an improved Yolov5 network structure, and putting the processed picture training set, the picture test set and the picture verification set into the improved Yolov5 for training; after training is completed, a best test. Pt model with the best detection effect is obtained; the method specifically comprises the following steps:

the first step: the improved Yolov5 network structure is built, the Yolov5 is improved, and the improvement points of the Yolov5 are as follows: and replacing the original neck network of Yolov5 with a weighted bidirectional pyramid network BiFPN for feature extraction.

Introducing a CBAM convolution attention module into a Backbone network of a Yolov5 network, wherein the CBAM convolution attention module combines a channel attention mechanism and a space attention mechanism; the method comprises the steps of extracting features by a Backbone network of a backhaul, respectively carrying out global average pooling and global maximum pooling on a single feature layer in input feature layers by a CBAM module on a attention mechanism of a channel, converting the single feature layer into two 1x1 forms, adding the results of the global average pooling and the global maximum pooling by using a full connection layer, carrying out sigmoid operation on the added results to obtain a weight of each feature channel, and multiplying the weight by the original feature layer to obtain the features of the channel.

The attention mechanism of the CBAM module for the space is characterized in that the maximum value and the average value of each feature point on an input feature layer are taken, the maximum value and the average value are stacked, a single feature layer is converted into 2 channels, the number of the channels is adjusted by convolution with the number of the channels being 1 again, the single feature layer is converted into 1 channel again, sigmoid operation is carried out on the processed feature points, the weight of each feature point is obtained, and the feature of the feature point can be obtained by multiplying the weight with the feature point on the original feature layer.

And a second step of: introducing an attention mechanism in a backbone network, adding a CBAM module, and combining the attention mechanisms of two dimensions of a feature channel and a feature space; a CBAM module is introduced into the last layer of the Backbone network of the back bone to meet the requirement of a subsequent Neck bottleneck structure, and a bi-directional pyramid network BiFPN introduces a learnable weight for different scale features so as to better balance the feature information of different scales; namely, introducing a learnable weight parameter O to the features of different scales to control the weight of each layer of features, wherein the specific distribution mode of O is as follows:

And a third step of: the SPD-Conv module is used for replacing the original CNN module to obtain a Yolov5-SPD module which is used for processing low-resolution and smaller targets; the Yolov5-SPD module comprises an SPD layer and a non-strided convolutio layer; the SPD layer performs downsampling on the original feature map, cuts a certain feature map according to proportion to obtain a series of sub-feature maps, and splices the sub-feature maps according to channels to obtain an intermediate feature map, wherein the specific mode is as follows:

f _m-1,n-1 ＝X[scale-1:m:scale,scale-1:n:scale]；

Fourth step: replacing the original IoU function with an EIoU loss function; the GIoU in the EIoU loss function can split the loss term of the aspect ratio into the difference value of the predicted width and height and the minimum external frame width and height respectively; meanwhile, focal Loss is introduced, so that the optimized contribution of a large number of anchor frames which are less overlapped with the target frame to BBox regression is reduced, and the regression process is more focused on high-quality frames, wherein the specific formula is as follows:

E _loss ＝IoU _loss +dis _loss +asp _loss

Fifth step: replacing the SilU activation function with the Mish activation function; the Mish activation function has a lower bound, has smaller weight on a negative half shaft, can prevent the occurrence of neuron necrosis phenomenon, and can generate stronger regularization effect; a small amount of negative information is reserved, so that the phenomenon of the Dying ReLU of the ReLU is avoided, and better expression and information flow are facilitated; the specific formula of the Mish activation function is:

Mish(x)＝x*Tanh(Softplus(x))；

Sixth step: before the processed picture training set, picture test set and picture verification set are put into the improved Yolov5 for training, network training parameters need to be set, and the specific operation mode is as follows: the iteration number is set to 200, the bitchsize is 16, and the initial learning rate is 0.0001.

Claims

1. According to the vehicle-mounted target detection method based on the improved Yolov5, the Yolov5 network structure is improved, and the obstacle detection of a complex road is realized; the specific operation steps are as follows:

step 1: collecting a front image of a vehicle through a camera;

the improved Yolov5 network structure is built, the Yolov5 is improved, and the improvement points of the Yolov5 are as follows:

firstly, replacing an original neck network of Yolov5 with a weighted bidirectional pyramid network BiFPN to extract features; introducing an attention mechanism in a backbone network, adding a CBAM module, and combining the attention mechanisms of two dimensions of a feature channel and a feature space; introducing a feature fusion neck network into a Yolov5 backbone network, and extracting features by a weighted bidirectional pyramid network BiFPN, wherein the specific operation mode is as follows:

introducing a CBAM convolution attention module into a Backbone network of a Yolov5 network, wherein the CBAM convolution attention module combines a channel attention mechanism and a space attention mechanism; the method comprises the steps that a Backbone network of a backhaul extracts features, a CBAM module carries out global average pooling and global maximum pooling on single feature layers in input feature layers respectively by a focus mechanism of a channel, converts the single feature layers into two 1x1 forms, adds the results of the global average pooling and the global maximum pooling by using a full connection layer, carries out sigmoid operation on the added results to obtain a weight of each feature channel, and multiplies the weight by an original feature layer to obtain features of the channel;

the attention mechanism highlights the key part in the characteristics, simultaneously focuses on the position information and semantic information of the target, introduces the attention mechanism in both the bottom characteristic layer and the high-level characteristic layer of the Backbone network of the Backbone, namely adds a CBAM module in the 6 th layer, the 11 th layer, the 16 th layer and the last layer, highlights the bottom and high-level characteristic information, introduces the CBAM module in the last layer of the Backbone network of the Backbone to meet the requirement of the subsequent Neck bottleneck structure, and specifically operates in the following modes: the bi-directional pyramid network BiFPN introduces a learnable weight to the features of different scales so as to better balance the feature information of the different scales; namely, introducing a learnable weight parameter O to the features of different scales to control the weight of each layer of features, wherein the specific distribution mode of O is as follows:

wherein P is _i ^td Is P _i Intermediate properties of layers, P _i ^out Is P _i Layer output characteristics, resize will P _i-1 、P _i+1 Feature layer conversion to and P _i The same size;

secondly, an SPD-Conv module is used for replacing an original CNN module to obtain a Yolov5-SPD module, wherein the Yolov5-SPD module comprises an SPD layer and a non-strided convolutio layer; the SPD layer performs downsampling on the original feature map, cuts a certain feature map according to proportion to obtain a series of sub-feature maps, and splices the sub-feature maps according to channels to obtain an intermediate feature map, wherein the specific mode is as follows:

f _m-1,n-1 ＝X[scale-1:m:scale,scale-1:n:scale]；

the non-strided convolutio layer uses a non-stride convolution mode to keep the feature information for discrimination as far as possible, and simultaneously controls the depth and the width of the intermediate feature map to meet the requirements of the depth and the width of the subsequent network; the Yolov5-SPD module is used for replacing the original CNN for processing the low-resolution and smaller targets, so that the precision of identifying the low-resolution and smaller targets can be improved;

then, replace the original IoU function with the EIoU loss function; the GIoU in the EIoU loss function can split the loss term of the aspect ratio into the difference value of the predicted width and height and the minimum external frame width and height respectively; meanwhile, focal Loss is introduced, so that the optimized contribution of a large number of anchor frames which are less overlapped with the target frame to BBox regression is reduced, and the regression process is more focused on high-quality frames, wherein the specific formula is as follows:

E _loss ＝IoU _loss +dis _loss +asp _loss

wherein dis _loss As center point loss, asp _loss For length and width loss ρ ² (b,b ^gt ) Euclidean distance, ρ representing the center point of the predicted and real frames ² (w,w ^gt )、ρ ² (h,h ^gt ) Representing the Euclidean distance of the width and height of the prediction and real frames, respectively, c representing the diagonal distance of the minimum enclosed region containing both the prediction and real frames, c _w Representing the width of the smallest closed area containing both the predicted and real frames, c _h Representing the height of the minimum closed area containing both the predicted and real frames;

finally, replacing the SiLU activation function by using the Mish activation function; the Mish activation function has a lower bound, has smaller weight on a negative half shaft, can prevent the occurrence of neuron necrosis phenomenon, and can generate stronger regularization effect; a small amount of negative information is reserved, so that the phenomenon of the Dying ReLU of the ReLU is avoided, and better expression and information flow are facilitated; the specific formula of the Mish activation function is:

Mish(x)＝x*Tanh(Softplus(x))；

where Tanh is a hyperbolic tangent function, softplus is an activation function, which can be considered as smoothing of ReLu;

2. The improved YOLOv 5-based vehicle-mounted object detection method of claim 1, wherein: before the processed picture training set, the picture testing set and the picture verifying set are put into the improved Yolov5 for training, the network training parameters need to be set in the specific operation mode: the iteration number is set to 200, the bitchsize is 16, and the initial learning rate is 0.0001.

3. The improved YOLOv 5-based vehicle-mounted object detection method of claim 1, wherein: the step 1 of collecting the front image of the vehicle through the camera comprises the following specific operation modes: the camera is arranged at the top of the vehicle and used for collecting images in front of the vehicle; during the running process of the vehicle, the camera can collect video streams in front of the vehicle.

4. The improved YOLOv 5-based vehicle-mounted object detection method of claim 1, wherein: and step 2, extracting key frames from the video streams acquired by the cameras respectively, namely extracting current frames from the video streams acquired by the cameras at intervals of 1s as key frames, and storing the key frames into a picture data set.

5. The improved YOLOv 5-based vehicle-mounted object detection method of claim 1, wherein: the preprocessing of the collected picture data set in the step 2 specifically includes the steps of: removing pictures which do not contain targets, have fuzzy characteristics and have disordered backgrounds; labeling the screened pictures, labeling targets to be detected on the pictures, such as culverts, height limiting rods, trees and other barriers by using rectangular frames, recording names of the targets and coordinates of the rectangular frames, and generating txt files for storage; finally, dividing the picture data set into a training set, a testing set and a verification set according to the ratio of 7:2:1.