CN110929577A

CN110929577A - An improved target recognition method based on YOLOv3 lightweight framework

Info

Publication number: CN110929577A
Application number: CN201911013341.1A
Authority: CN
Inventors: 陈名松; 张泽功; 吴泳蓉; 吴冉冉
Original assignee: Guilin University of Electronic Technology
Current assignee: Guilin University of Electronic Technology
Priority date: 2019-10-23
Filing date: 2019-10-23
Publication date: 2020-03-27

Abstract

The invention discloses an improved target recognition method based on a lightweight framework of YOLOv3, which is used for target detection and recognition by combining YOLOv3-tiny and SENet, a lightweight version of YOLOv3, to obtain YOLOv3-tiny-SE. Specifically, it includes: collecting pictures of vehicles, pedestrians and traffic environments under different road conditions, driving environments and weather conditions, preprocessing and data enhancement of the collected data, making and improving target recognition sample sets, and labeling the sample sets , and then divide the sample set into two parts: training set and test set, embed the SENet structure in YOLOv3‑tiny to obtain YOLOv3‑tiny‑SE, train YOLOv3‑tiny‑SE on the training set, and test YOLOv3‑tiny on the test set ‑SE, and then compare with YOLOv3‑tiny performance. The target recognition method proposed by the invention has strong generalization ability, and can speed up target detection, improve the accuracy of small target detection, and improve the robustness of model parameters to noise.

Description

Improved target identification method based on YOLOv3 lightweight framework

Technical Field

The invention relates to the field of computer vision and deep learning, in particular to a lightweight frame improved target identification method based on YOLOv 3.

Background

The unmanned technology is characterized in that collected real-time road condition video images (including information of pedestrians, vehicles, traffic signs and the like) are subjected to modeling processing through camera tools such as a driving recorder and the like to obtain automobile drive-by-wire state parameters, and then the parameters are input into a decision and control network model of the vehicle to perform decision control on vehicle behaviors. The target detection is a precondition of behavior decision, and the target detection method based on deep learning not only ensures the accuracy of multi-target detection classification, but also meets the real-time processing requirement. Currently, mainstream target detection methods based on machine learning are mainly divided into two major methods based on Region nomination (Region probable) as an idea and a regression method.

The methods based on region nomination mainly include methods such as R _ CNN, SPP _ Net, Fast R _ CNN and Fast R _ CNN. The Fast R _ CNN method greatly reduces the operation time by utilizing a method of sharing a characteristic layer, in addition, the classification and regression method is changed from using an SVM model to using SoftMax for classification, and classification and regression are simultaneously carried out in a multitask mode, so the operation time of target detection is reduced to a certain extent, but all candidate frames are searched in the selective search process, the process is very time-consuming, and the bottleneck problem of calculation speed exists. The fast R _ CNN method directly utilizes the RPN network to extract candidate frames, and the operations of Region nomination, classification, regression and the like share convolution characteristics together, so that the operation speed is further improved.

A representative method based on a regression method is a YOLO method, the YOLO method simplifies the whole process of target detection, video frame images are zoomed into images with uniform size, but only two boundary frames are predicted for each cell in the specific implementation process, and the two boundary frames belong to one category, so that the YOLO method is not high enough in small target detection accuracy, weak in generalization capability and incapable of meeting the requirements of unmanned driving on multi-target detection.

Disclosure of Invention

The invention aims to provide a lightweight frame improved target identification method based on YOLOv3, which is used for solving the problems in the prior art and aims to improve the robustness of model parameters to noise while ensuring the improvement of the target detection speed and the accuracy of small target detection.

In order to achieve the purpose, the invention provides the following scheme: the invention provides a lightweight frame improved target identification method based on YOLOv3, which comprises the following steps:

s1, collecting pictures of vehicles, pedestrians and traffic environments under different road conditions, driving environments and weather conditions, and making an initial sample data set; specifically, step S1 includes:

s11, starting a driving recorder or a high-definition camera installed by a vehicle, and shooting driving information in a road traffic environment in real time;

s12, performing framing processing on the obtained driving video, and extracting images of each frame to obtain driving image sequence sets in different driving environments;

s13, screening the driving image sequence set obtained in the step S12, and selecting driving images under different illumination conditions, traffic time periods and environmental backgrounds;

and S14, marking the selected driving image by using a marking tool, framing a target area, wherein the target area comprises vehicles, pedestrians and traffic signs, and then labeling the target area to make an initial sample data set.

S2, preprocessing and enhancing the picture data in the initial sample data set to obtain a target identification sample data set, specifically, the step S2 includes: and (5) processing the characteristic parameters of the target to be recognized through translation, rotation, saturation adjustment, exposure adjustment and noise addition operation on the initial sample data set obtained in the step (S1) to obtain a complete sample data set.

And S3, dividing the obtained target recognition sample data set into a training set and a test set.

S4, embedding a SEnet structure in a YOLOv3-tiny method frame to obtain a YOLOv3-tiny-SE network model, specifically, the step S4 comprises the following steps:

embedding a SEnet structure in a YOLOv3-tiny method, embedding the SEnet structure after each pooling layer and after a convolutional layer before a final output result, adding the SEnet structure after the pooling layers of the 2 nd, 4 th, 6 th, 8 th, 10 th and 12 th layers and the convolutional layers of the 13 th, 14 th, 15 th, 19 th and 22 th layers by modifying a YOLOv3-tiny.cfg file, and specifying the characteristic channel values 16, 32 th, 64 th, 128 th, 256 th, 512 th, 1024 th, 256 th, 512 th and 256 of the input global pooling layer of the SEnet structure as the number of the characteristic channels output by the embedding layers to obtain a YOLOv3-tiny-SE network model.

S5, training a YOLOv3-tiny-SE network model on a training set, specifically, the step S5 comprises:

s51, after the sample data set is enhanced and the parameters are marked in the step S2, recalculating the anchorbox value of the prepared complete sample data set; calculating an anchor value in a traffic environment by using a K-means clustering method, and comprising the following steps of: reading the marked data set, randomly taking out the width and height values of one picture as coordinate points and initial clustering centers, and performing iterative computation by using a K-means clustering method to obtain a specific anchor value;

s52, setting hyper-parameters and network parameters during training, inputting the training set into a YOLOv3-tiny-SE network model for multi-task training, and storing a trained network model weight file.

S6, testing the performance of YOLOv3-tiny-SE on the test set, specifically, the step S6 comprises the following steps:

s61, loading the trained network model weight file obtained in the step S52, inputting a test set into the trained YOLOv3-tiny-SE network model, and obtaining a multi-scale characteristic diagram through a convolutional layer, a pooling layer, a SENET structure and an upper sampling layer;

s62, activating the x, y, confidence coefficient and category probability of the network prediction by adopting a logistic function, and obtaining the coordinates, confidence coefficient and category probability of all prediction frames through threshold judgment;

and S63, removing redundant detection frames from the result obtained in the step S62 through non-maximum suppression processing, and generating a final target detection frame and a recognition result.

S7, comparing the performance test result of the YOLOv3-tiny-SE obtained in the step S6 on the test set with the performance of the YOLOv3-tiny to obtain the result of performance comparison.

The invention discloses the following technical effects: aiming at the problems that the detection speed of a target is low and the detection accuracy of a small target is not accurate enough in a complex environment in the prior art, a lightweight version YOLOv3-tiny of YOLOv3 is combined with a SEnet structure to obtain a YOLOv3-tiny-SE network model, and the obtained YOLOv3-tiny-SE network model is used for target detection and identification. The invention also provides an improved activation function: PSReLU function and is used to activate the model. By the target identification method, real-time road condition video images acquired by camera tools such as a vehicle data recorder can be processed quickly, accurately in real time, and scientific basis is provided for decision control of vehicle behaviors in automatic driving.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a block diagram of the system of the present application;

FIG. 2 is a diagram of the improved PSReLU activation function of the present application;

FIG. 3 is a diagram of a YOLOv3-tiny-SE network model structure;

FIG. 4 is a SENET structure diagram.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Referring to fig. 1-4, the invention provides an improved target identification method based on a lightweight framework of YOLOv3, which specifically comprises the following steps:

s1, collecting pictures of vehicles, pedestrians and traffic environments under different roads, traffic environments and weather conditions, and making an initial sample data set; the method comprises the following steps:

s11, starting a driving recorder or a high-definition camera installed by a vehicle, and shooting driving videos in a road traffic environment in real time;

s13, screening the driving image sequence set, and selecting the driving images under different illumination conditions, traffic periods and environmental backgrounds;

and S14, marking the selected driving image by using a marking tool, marking the target to be identified in the sample data set by using a Labelimg sample marking tool for parameters, framing a target area (specifically comprising three types of vehicles, pedestrians and traffic signs), and marking a label to manufacture an initial sample data set.

S2, preprocessing and enhancing the data of the manufactured initial sample data set, perfecting the initial sample data set and obtaining a target identification sample data set; the method comprises the following specific steps:

processing the initial sample data set, and performing the following operations through a writing program: on the basis of the existing initial sample data set, the characteristic parameters of the target to be recognized are processed through translation, rotation, saturation degree and exposure amount adjustment and noise adding operation, sample data are added, the target recognition sample data set is obtained to improve the initial sample data set, and the generalization capability of the neural network is improved.

And S3, dividing the obtained target recognition sample data set into a training set and a test set according to the proportion of 7:3, 8:2 or 8: 1.

S4, embedding a SENet structure in a YOLOv3-tiny method to obtain a YOLOv3-tiny-SE network model; comprises the following steps:

s41, improving the lightweight frame of YOLOv3 in the step shown in the table 1, and embedding a SENET structure into the YOLOv3-tiny frame to obtain an improved YOLOv3-tiny-SE network model shown in the figure 3;

s42 and YOLOv3-tiny as a lightweight framework of YOLOv3, wherein the overall network architecture is shown in Table 1, and specifically comprises 13 convolutional layers, 6 pooling layers, 2 fusion layers, 1 upsampling layer and 2 output layers with different scales, compared with YOLOv3, the overall architecture reduces residual layers, replaces a series of pooling layers, and simultaneously omits some convolutional layers and FPN networks for extracting features, thereby simplifying the network, reducing the operation complexity and improving the recognition speed;

the processing idea of S43 and Yolov3-tiny on target detection and recognition is the same as that of Yolov3, and Yolov3 carries out Batch Normalization (BN) operation after convolution of each convolutional layer so as to avoid the occurrence of network training overfitting phenomenon, and then uses a Leaky-Relu function as an activation function after Batch normalization;

s44 and YOLOv3 add an FPN structure on the basis of the previous two generations of methods to improve the recognition accuracy of the multiple scale target, and the specific steps are as follows:

firstly, an image pyramid is established for an image, image pyramids of different levels are input into corresponding networks, target detection is respectively carried out on feature maps of different depths, the feature maps of a future layer are up-sampled through the feature map of a current layer and are utilized, so that the current feature map can obtain information of the future layer, low-order semantic information and high-order semantic information are organically fused, the detection precision is improved, the defects of the former two versions of methods are improved, an FPN (field programmable gate network) is introduced into a YOLOv3 frame, the precision of small target identification is improved, and the identification of traffic signs is more effective;

s45 and YOLOv3-tiny-SE network model as shown in figure 3, the SEnet structure firstly performs global average pooling on input feature maps to obtain feature maps (c is the number of feature channels) with the size of cx1 × 1, then passes through two full connection layers, performs the processes of firstly reducing and then increasing dimensions, finally performs nonlinear processing by using a Sigmoid function to obtain weights with the size of cx1 × 1, and then performs multiplication operation on the weights and the original input feature maps at corresponding positions to obtain the final output result;

s46, embedding a SENet structure in a YOLOv3-tiny method; the method comprises the following specific steps:

embedding a SENet structure after each pooling layer and after the convolutional layer before final output, adding the SENet structure after the pooling layers of the 2 nd, 4 th, 6 th, 8 th, 10 th and 12 th layers and the convolutional layers of the 13 th, 14 th, 15 th, 19 th and 22 th layers by modifying a YOLOv3-tiny file, and specifying the characteristic channel values 16, 32 th, 64 th, 128 th, 256 th, 512 th, 1024 th, 256 th, 512 th and 256 th layers of the global pooling layer of the SENet structure as the number of the characteristic channels output by the embedding layers to obtain a YOLOv3-tiny-SE network model;

the network depth of S47 and YOLOV3-tiny originally is 24 layers, and becomes 35 layers after embedding SENet structure, the main purpose of embedding SENet network is to strengthen useful information and compress useless information, wherein the concrete steps of the embedded SENet structure take a second layer of pooling layer as an example, the feature map output by the pooling layer is 208 x 16, and is also the input feature map size of a Global pooling layer (Global posing), the feature map of 1 x 16 is obtained after Global averaging pooling, then the feature map of 1 x 1 is obtained after dimensionality reduction through a first Full connection (Full connected), the feature map of 1 x 16 is obtained after dimensionality increase through a second Full connection (Full connected), finally the weight value of 1 x 16 is obtained after activation of a Sigmoid function, and finally the weight value of 1 x 16 is obtained by multiplying the feature map with the input feature map 208 and the weight value of 16 is obtained.

TABLE 1

S5, training a YOLOv3-tiny-SE network model on a training set, and specifically comprises the following steps:

s51, clustering real target frames of the target to be recognized marked in the training set, obtaining an initial candidate frame of the target predicted in the training set by taking the area interaction ratio IOU as an evaluation index, and inputting the initial candidate frame as an initial parameter into a Yolov3-tiny-SE network model, wherein the method specifically comprises the following steps:

clustering the real target frame of the training data set by adopting a K-means method according to a distance formula dis (box, centroid) which is 1-IOU (box, centroid); the IOU (box, centroid) is the area interaction ratio of the predicted target frame and the real target frame, and when the IOU (box, centroid) is used as an evaluation standard and the value is not less than 0.5, the predicted candidate frame is used as an initial target frame;

the area interaction ratio IOU (box, centroid) formula is shown as follows:

wherein, box_predAnd box_truthRespectively representing the areas of the predicted target frame and the real target frame, wherein the proportion of the intersection and the union of the predicted target frame and the real target frame is the initial candidate of the real target frame and the predicted initial targetAverage interaction ratio of the bounding box;

s53, calling the initial weight of the YOLOv3-tiny network, setting the super parameters, the learning rate, the iteration step number N and the size of batch _ size, wherein the super parameters can be adjusted according to the obtained model data; then inputting the training data set into a Yolov3-tiny-SE network model for training until the loss value output by the training data set is smaller than a certain threshold Q1 or reaches the preset maximum iteration number N, and stopping training to obtain a well-trained Yolov3-tiny-SE network model; the method comprises the following specific steps:

calling initial network weight of YOLOv3-tiny, inputting a training data set into a YOLOv3-tiny network for training, outputting a loss function value, continuously training and adjusting the network weight and a bias value according to the loss function value until the loss function value output by the training set is smaller than a threshold value Q1 or the training is stopped after the maximum iteration number N is reached to obtain a trained YOLOv3-tiny-SE network model;

the loss function (object) is expressed by the following formula

Each term of the loss function corresponds to a loss of the prediction center coordinate, a loss of the prediction bounding box, a loss of the prediction confidence degree and a loss of the prediction category. The loss functions of the prediction center coordinates and the bounding boxes are expressed by error square sum, and the loss functions of the prediction categories and the confidence degrees are expressed by courtyard cross entropy loss functions;

in the above formula, λ_coordError coefficients that are predicted coordinates; lambda [ alpha ]_noobjectAn error coefficient that does not contain a confidence level when identifying an object; k²Indicating the number of meshes into which the input image is divided; m represents the predicted target frame number of each grid; x is the number of_i,y_i,w_i,h_iRespectively representing the abscissa and ordinate of the predicted center point of the target and the width and height,

respectively representing the horizontal and vertical coordinates, the width and the height of the central point of a real target;

the ith grid representing the jth candidate box is responsible for detecting the object;

indicating that the ith grid in which the jth candidate box is positioned is not responsible for detecting the object; c_iAnd

respectively representing the prediction confidence coefficient and the real confidence coefficient of the target to be detected in the ith grid; p is a radical of_i(c) And

respectively representing a prediction probability value and a real probability value of the target identification in the ith network belonging to a certain category;

the activation function of YOLOv3 after convolution layer adopts a Leaky-ReLU function, and the expression of the function is shown as the following formula:

the Leaky-ReLU function is evolved from the ReLU function, values obtained by the ReLU function when x is less than or equal to 0 are all 0, so that the problem that the neuron weight cannot be updated possibly occurs along with training, the problem is not large in influence on a deep neural network, but the problem is large in influence on a neural network with a shallow layer number, so that the output of the Leaky-ReLU function, which is 0 in a complex number field, is changed into a linear function with a small slope on the basis of ReLu, the output of a negative number field is reserved, but the parameter a is a proper parameter value determined through artificial prior and repeated training for many times, and the noise robustness in an inactive state cannot be ensured. Based on the above problem, the present embodiment proposes an improved activation function PSReLU (Parametric Soft plus-ReLU) function as shown in fig. 2, and the function expression is as follows:

in the positive value domain, the YOLOv3-tiny adopts an activation function, leak-ReLU, which is the same as a ReLU function, in the negative value domain, a Softplus function is adopted, log2 units are shifted downwards, and a parameter α is taken as a learnable parameter in the network, and back propagation training is carried out in the network to be jointly optimized with other network layers.

S6, testing the performance of the YOLOv3-tiny-SE network model on a test set, specifically, the step S6 comprises the following steps:

s61, loading the trained network weight, inputting the test set into the trained network, and obtaining a multi-scale characteristic diagram through a convolution layer, a pooling layer, a SENET structure and an up-sampling layer;

s63, removing redundant detection frames from the result through non-maximum suppression processing (NMS) to generate a final target detection frame and an identification result;

s64, comparing the effect of the YOLOv3-tiny native model with the effect of the improved YOLOv3-tiny-SE network model by using the original activation function; respectively using an improved activation function and a primitive activation function to perform performance test on a Yolov3-tiny native model; respectively using an improved activation function and an original activation function to perform performance test on a YOLOv3-tiny-SE network model;

s65, respectively inputting the test sets obtained in the step S3 into the network corresponding to the step S61 for performance detection, and obtaining the final evaluation indexes for model performance, including Average accuracy mean mAP (mean Average precision), number of frames Per second detected FPS (frames Per second) and Recall rate (Recall).

S7, comparing the performance test result of the YOLOv3-tiny-SE network model obtained in the step S6 on the test set with the performance of YOLOv3-tiny to obtain a performance comparison result.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Various modifications and improvements of the technical solution of the present invention may be made by those skilled in the art without departing from the spirit of the present invention, and the technical solution of the present invention is to be covered by the protection scope defined by the claims.

Claims

1. an improved target recognition method based on the lightweight frame of YOLOv3, is characterized in that: comprise the steps:

S1. Collect pictures of vehicles, pedestrians and traffic environments under different road conditions, driving environments and weather conditions, and create an initial sample data set;

S2, performing preprocessing and data enhancement on the image data in the initial sample data set to obtain a target recognition sample data set;

S3. Divide the obtained target recognition sample data set into two parts: training set and test set;

S4. Embed the SENet structure in the YOLOv3-tiny method framework to obtain the YOLOv3-tiny-SE network model;

S5. Train the YOLOv3-tiny-SE network model on the training set;

S6. Test the performance of YOLOv3-tiny-SE on the test set;

S7. Compare the performance test result of the YOLOv3-tiny-SE network model obtained in step S6 on the test set with that of YOLOv3-tiny to obtain a performance comparison result.

2. the improved target recognition method based on the lightweight framework of YOLOv3 according to claim 1, is characterized in that: step S1 specifically comprises:

S11. Turn on the driving recorder or the high-definition camera installed by the vehicle to capture real-time driving information under the road traffic environment;

S12. Perform frame-by-frame processing on the obtained driving video, and extract the image of each frame to obtain a driving image sequence set under different driving environments;

S13, screening the driving image sequence set obtained in step S12, and selecting driving images under different lighting conditions, traffic periods and environmental backgrounds;

S14. Use a labeling tool to label the selected driving image to frame a target area, where the target area includes vehicles, pedestrians and traffic signs, and then label the target area to create an initial sample data set.

3. the improved target recognition method based on the lightweight frame of YOLOv3 according to claim 1, is characterized in that: step S2 specifically comprises: the initial sample data set that step S1 obtains by translation, rotation, adjustment saturation and exposure The characteristic parameters of the target to be identified are processed to obtain a complete sample data set.

4. the improved target recognition method based on the lightweight frame of YOLOv3 according to claim 1, is characterized in that: step S4 specifically comprises: in YOLOv3-tiny method, embed SENet structure, after each pooling layer and final The SENet structure is embedded in the convolutional layer before the output result. By modifying the YOLOv3-tiny.cfg file, the pooling layers of the 2nd, 4th, 6th, 8th, 10th and 12th layers and the 13th, 14th, 15th, 19th and 22nd layers are modified. The SEnet structure is added after the convolution layer of the layer, and the feature channel values 16, 32, 64, 128, 256, 512, 1024, 256, 512, 128, 256 of the input global pooling layer of the SENet structure are specified as the output of the embedding layer. The number of feature channels to obtain the YOLOv3-tiny-SE network model.

5. the improved target recognition method based on the lightweight frame of YOLOv3 according to claim 1, is characterized in that: step S5 specifically comprises:

S51. After the sample data set is enhanced and the parameters are marked in step S2, the anchorbox value is recalculated for the prepared complete sample data set; the K-means clustering method is used to calculate the anchorbox value in the traffic environment, and the steps are as follows: Read the marked data set, randomly take the width and height of one of the pictures as the coordinate point and use it as the initial cluster center, and then use the K-means clustering method to iteratively calculate the specific anchorbox value;

S52. Set the hyperparameters and network parameters during training, and then input the training set into the YOLOv3-tiny-SE network model for multi-task training, and save the trained network model weight file.

6. the improved target recognition method based on the lightweight framework of YOLOv3 according to claim 1, is characterized in that: step S6 specifically comprises:

S61. Load the trained network model weight file obtained in step S52, input the test set into the above trained YOLOv3-tiny-SE network model, go through the convolution layer, pooling layer, SENet structure and upsampling layer to obtain multi-scale feature maps;

S62. Use the logistic function to activate the x, y, confidence, and category probability predicted by the network, and obtain the coordinates, confidence, and category probability of all prediction frames through threshold judgment;

S63 , using the result obtained in step S62 to remove redundant detection frames through non-maximum suppression processing to generate final target detection frames and recognition results.