Disclosure of Invention
The invention aims to provide a lightweight frame improved target identification method based on YOLOv3, which is used for solving the problems in the prior art and aims to improve the robustness of model parameters to noise while ensuring the improvement of the target detection speed and the accuracy of small target detection.
In order to achieve the purpose, the invention provides the following scheme: the invention provides a lightweight frame improved target identification method based on YOLOv3, which comprises the following steps:
s1, collecting pictures of vehicles, pedestrians and traffic environments under different road conditions, driving environments and weather conditions, and making an initial sample data set; specifically, step S1 includes:
s11, starting a driving recorder or a high-definition camera installed by a vehicle, and shooting driving information in a road traffic environment in real time;
s12, performing framing processing on the obtained driving video, and extracting images of each frame to obtain driving image sequence sets in different driving environments;
s13, screening the driving image sequence set obtained in the step S12, and selecting driving images under different illumination conditions, traffic time periods and environmental backgrounds;
and S14, marking the selected driving image by using a marking tool, framing a target area, wherein the target area comprises vehicles, pedestrians and traffic signs, and then labeling the target area to make an initial sample data set.
S2, preprocessing and enhancing the picture data in the initial sample data set to obtain a target identification sample data set, specifically, the step S2 includes: and (5) processing the characteristic parameters of the target to be recognized through translation, rotation, saturation adjustment, exposure adjustment and noise addition operation on the initial sample data set obtained in the step (S1) to obtain a complete sample data set.
And S3, dividing the obtained target recognition sample data set into a training set and a test set.
S4, embedding a SEnet structure in a YOLOv3-tiny method frame to obtain a YOLOv3-tiny-SE network model, specifically, the step S4 comprises the following steps:
embedding a SEnet structure in a YOLOv3-tiny method, embedding the SEnet structure after each pooling layer and after a convolutional layer before a final output result, adding the SEnet structure after the pooling layers of the 2 nd, 4 th, 6 th, 8 th, 10 th and 12 th layers and the convolutional layers of the 13 th, 14 th, 15 th, 19 th and 22 th layers by modifying a YOLOv3-tiny.cfg file, and specifying the characteristic channel values 16, 32 th, 64 th, 128 th, 256 th, 512 th, 1024 th, 256 th, 512 th and 256 of the input global pooling layer of the SEnet structure as the number of the characteristic channels output by the embedding layers to obtain a YOLOv3-tiny-SE network model.
S5, training a YOLOv3-tiny-SE network model on a training set, specifically, the step S5 comprises:
s51, after the sample data set is enhanced and the parameters are marked in the step S2, recalculating the anchorbox value of the prepared complete sample data set; calculating an anchor value in a traffic environment by using a K-means clustering method, and comprising the following steps of: reading the marked data set, randomly taking out the width and height values of one picture as coordinate points and initial clustering centers, and performing iterative computation by using a K-means clustering method to obtain a specific anchor value;
s52, setting hyper-parameters and network parameters during training, inputting the training set into a YOLOv3-tiny-SE network model for multi-task training, and storing a trained network model weight file.
S6, testing the performance of YOLOv3-tiny-SE on the test set, specifically, the step S6 comprises the following steps:
s61, loading the trained network model weight file obtained in the step S52, inputting a test set into the trained YOLOv3-tiny-SE network model, and obtaining a multi-scale characteristic diagram through a convolutional layer, a pooling layer, a SENET structure and an upper sampling layer;
s62, activating the x, y, confidence coefficient and category probability of the network prediction by adopting a logistic function, and obtaining the coordinates, confidence coefficient and category probability of all prediction frames through threshold judgment;
and S63, removing redundant detection frames from the result obtained in the step S62 through non-maximum suppression processing, and generating a final target detection frame and a recognition result.
S7, comparing the performance test result of the YOLOv3-tiny-SE obtained in the step S6 on the test set with the performance of the YOLOv3-tiny to obtain the result of performance comparison.
The invention discloses the following technical effects: aiming at the problems that the detection speed of a target is low and the detection accuracy of a small target is not accurate enough in a complex environment in the prior art, a lightweight version YOLOv3-tiny of YOLOv3 is combined with a SEnet structure to obtain a YOLOv3-tiny-SE network model, and the obtained YOLOv3-tiny-SE network model is used for target detection and identification. The invention also provides an improved activation function: PSReLU function and is used to activate the model. By the target identification method, real-time road condition video images acquired by camera tools such as a vehicle data recorder can be processed quickly, accurately in real time, and scientific basis is provided for decision control of vehicle behaviors in automatic driving.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
Referring to fig. 1-4, the invention provides an improved target identification method based on a lightweight framework of YOLOv3, which specifically comprises the following steps:
s1, collecting pictures of vehicles, pedestrians and traffic environments under different roads, traffic environments and weather conditions, and making an initial sample data set; the method comprises the following steps:
s11, starting a driving recorder or a high-definition camera installed by a vehicle, and shooting driving videos in a road traffic environment in real time;
s12, performing framing processing on the obtained driving video, and extracting images of each frame to obtain driving image sequence sets in different driving environments;
s13, screening the driving image sequence set, and selecting the driving images under different illumination conditions, traffic periods and environmental backgrounds;
and S14, marking the selected driving image by using a marking tool, marking the target to be identified in the sample data set by using a Labelimg sample marking tool for parameters, framing a target area (specifically comprising three types of vehicles, pedestrians and traffic signs), and marking a label to manufacture an initial sample data set.
S2, preprocessing and enhancing the data of the manufactured initial sample data set, perfecting the initial sample data set and obtaining a target identification sample data set; the method comprises the following specific steps:
processing the initial sample data set, and performing the following operations through a writing program: on the basis of the existing initial sample data set, the characteristic parameters of the target to be recognized are processed through translation, rotation, saturation degree and exposure amount adjustment and noise adding operation, sample data are added, the target recognition sample data set is obtained to improve the initial sample data set, and the generalization capability of the neural network is improved.
And S3, dividing the obtained target recognition sample data set into a training set and a test set according to the proportion of 7:3, 8:2 or 8: 1.
S4, embedding a SENet structure in a YOLOv3-tiny method to obtain a YOLOv3-tiny-SE network model; comprises the following steps:
s41, improving the lightweight frame of YOLOv3 in the step shown in the table 1, and embedding a SENET structure into the YOLOv3-tiny frame to obtain an improved YOLOv3-tiny-SE network model shown in the figure 3;
s42 and YOLOv3-tiny as a lightweight framework of YOLOv3, wherein the overall network architecture is shown in Table 1, and specifically comprises 13 convolutional layers, 6 pooling layers, 2 fusion layers, 1 upsampling layer and 2 output layers with different scales, compared with YOLOv3, the overall architecture reduces residual layers, replaces a series of pooling layers, and simultaneously omits some convolutional layers and FPN networks for extracting features, thereby simplifying the network, reducing the operation complexity and improving the recognition speed;
the processing idea of S43 and Yolov3-tiny on target detection and recognition is the same as that of Yolov3, and Yolov3 carries out Batch Normalization (BN) operation after convolution of each convolutional layer so as to avoid the occurrence of network training overfitting phenomenon, and then uses a Leaky-Relu function as an activation function after Batch normalization;
s44 and YOLOv3 add an FPN structure on the basis of the previous two generations of methods to improve the recognition accuracy of the multiple scale target, and the specific steps are as follows:
firstly, an image pyramid is established for an image, image pyramids of different levels are input into corresponding networks, target detection is respectively carried out on feature maps of different depths, the feature maps of a future layer are up-sampled through the feature map of a current layer and are utilized, so that the current feature map can obtain information of the future layer, low-order semantic information and high-order semantic information are organically fused, the detection precision is improved, the defects of the former two versions of methods are improved, an FPN (field programmable gate network) is introduced into a YOLOv3 frame, the precision of small target identification is improved, and the identification of traffic signs is more effective;
s45 and YOLOv3-tiny-SE network model as shown in figure 3, the SEnet structure firstly performs global average pooling on input feature maps to obtain feature maps (c is the number of feature channels) with the size of cx1 × 1, then passes through two full connection layers, performs the processes of firstly reducing and then increasing dimensions, finally performs nonlinear processing by using a Sigmoid function to obtain weights with the size of cx1 × 1, and then performs multiplication operation on the weights and the original input feature maps at corresponding positions to obtain the final output result;
s46, embedding a SENet structure in a YOLOv3-tiny method; the method comprises the following specific steps:
embedding a SENet structure after each pooling layer and after the convolutional layer before final output, adding the SENet structure after the pooling layers of the 2 nd, 4 th, 6 th, 8 th, 10 th and 12 th layers and the convolutional layers of the 13 th, 14 th, 15 th, 19 th and 22 th layers by modifying a YOLOv3-tiny file, and specifying the characteristic channel values 16, 32 th, 64 th, 128 th, 256 th, 512 th, 1024 th, 256 th, 512 th and 256 th layers of the global pooling layer of the SENet structure as the number of the characteristic channels output by the embedding layers to obtain a YOLOv3-tiny-SE network model;
the network depth of S47 and YOLOV3-tiny originally is 24 layers, and becomes 35 layers after embedding SENet structure, the main purpose of embedding SENet network is to strengthen useful information and compress useless information, wherein the concrete steps of the embedded SENet structure take a second layer of pooling layer as an example, the feature map output by the pooling layer is 208 x 16, and is also the input feature map size of a Global pooling layer (Global posing), the feature map of 1 x 16 is obtained after Global averaging pooling, then the feature map of 1 x 1 is obtained after dimensionality reduction through a first Full connection (Full connected), the feature map of 1 x 16 is obtained after dimensionality increase through a second Full connection (Full connected), finally the weight value of 1 x 16 is obtained after activation of a Sigmoid function, and finally the weight value of 1 x 16 is obtained by multiplying the feature map with the input feature map 208 and the weight value of 16 is obtained.
TABLE 1
S5, training a YOLOv3-tiny-SE network model on a training set, and specifically comprises the following steps:
s51, clustering real target frames of the target to be recognized marked in the training set, obtaining an initial candidate frame of the target predicted in the training set by taking the area interaction ratio IOU as an evaluation index, and inputting the initial candidate frame as an initial parameter into a Yolov3-tiny-SE network model, wherein the method specifically comprises the following steps:
clustering the real target frame of the training data set by adopting a K-means method according to a distance formula dis (box, centroid) which is 1-IOU (box, centroid); the IOU (box, centroid) is the area interaction ratio of the predicted target frame and the real target frame, and when the IOU (box, centroid) is used as an evaluation standard and the value is not less than 0.5, the predicted candidate frame is used as an initial target frame;
the area interaction ratio IOU (box, centroid) formula is shown as follows:
wherein, boxpredAnd boxtruthRespectively representing the areas of the predicted target frame and the real target frame, wherein the proportion of the intersection and the union of the predicted target frame and the real target frame is the initial candidate of the real target frame and the predicted initial targetAverage interaction ratio of the bounding box;
s53, calling the initial weight of the YOLOv3-tiny network, setting the super parameters, the learning rate, the iteration step number N and the size of batch _ size, wherein the super parameters can be adjusted according to the obtained model data; then inputting the training data set into a Yolov3-tiny-SE network model for training until the loss value output by the training data set is smaller than a certain threshold Q1 or reaches the preset maximum iteration number N, and stopping training to obtain a well-trained Yolov3-tiny-SE network model; the method comprises the following specific steps:
calling initial network weight of YOLOv3-tiny, inputting a training data set into a YOLOv3-tiny network for training, outputting a loss function value, continuously training and adjusting the network weight and a bias value according to the loss function value until the loss function value output by the training set is smaller than a threshold value Q1 or the training is stopped after the maximum iteration number N is reached to obtain a trained YOLOv3-tiny-SE network model;
the loss function (object) is expressed by the following formula
Each term of the loss function corresponds to a loss of the prediction center coordinate, a loss of the prediction bounding box, a loss of the prediction confidence degree and a loss of the prediction category. The loss functions of the prediction center coordinates and the bounding boxes are expressed by error square sum, and the loss functions of the prediction categories and the confidence degrees are expressed by courtyard cross entropy loss functions;
in the above formula, λ
coordError coefficients that are predicted coordinates; lambda [ alpha ]
noobjectAn error coefficient that does not contain a confidence level when identifying an object; k
2Indicating the number of meshes into which the input image is divided; m represents the predicted target frame number of each grid; x is the number of
i,y
i,w
i,h
iRespectively representing the abscissa and ordinate of the predicted center point of the target and the width and height,
respectively representing the horizontal and vertical coordinates, the width and the height of the central point of a real target;
the ith grid representing the jth candidate box is responsible for detecting the object;
indicating that the ith grid in which the jth candidate box is positioned is not responsible for detecting the object; c
iAnd
respectively representing the prediction confidence coefficient and the real confidence coefficient of the target to be detected in the ith grid; p is a radical of
i(c) And
respectively representing a prediction probability value and a real probability value of the target identification in the ith network belonging to a certain category;
the activation function of YOLOv3 after convolution layer adopts a Leaky-ReLU function, and the expression of the function is shown as the following formula:
the Leaky-ReLU function is evolved from the ReLU function, values obtained by the ReLU function when x is less than or equal to 0 are all 0, so that the problem that the neuron weight cannot be updated possibly occurs along with training, the problem is not large in influence on a deep neural network, but the problem is large in influence on a neural network with a shallow layer number, so that the output of the Leaky-ReLU function, which is 0 in a complex number field, is changed into a linear function with a small slope on the basis of ReLu, the output of a negative number field is reserved, but the parameter a is a proper parameter value determined through artificial prior and repeated training for many times, and the noise robustness in an inactive state cannot be ensured. Based on the above problem, the present embodiment proposes an improved activation function PSReLU (Parametric Soft plus-ReLU) function as shown in fig. 2, and the function expression is as follows:
in the positive value domain, the YOLOv3-tiny adopts an activation function, leak-ReLU, which is the same as a ReLU function, in the negative value domain, a Softplus function is adopted, log2 units are shifted downwards, and a parameter α is taken as a learnable parameter in the network, and back propagation training is carried out in the network to be jointly optimized with other network layers.
S6, testing the performance of the YOLOv3-tiny-SE network model on a test set, specifically, the step S6 comprises the following steps:
s61, loading the trained network weight, inputting the test set into the trained network, and obtaining a multi-scale characteristic diagram through a convolution layer, a pooling layer, a SENET structure and an up-sampling layer;
s62, activating the x, y, confidence coefficient and category probability of the network prediction by adopting a logistic function, and obtaining the coordinates, confidence coefficient and category probability of all prediction frames through threshold judgment;
s63, removing redundant detection frames from the result through non-maximum suppression processing (NMS) to generate a final target detection frame and an identification result;
s64, comparing the effect of the YOLOv3-tiny native model with the effect of the improved YOLOv3-tiny-SE network model by using the original activation function; respectively using an improved activation function and a primitive activation function to perform performance test on a Yolov3-tiny native model; respectively using an improved activation function and an original activation function to perform performance test on a YOLOv3-tiny-SE network model;
s65, respectively inputting the test sets obtained in the step S3 into the network corresponding to the step S61 for performance detection, and obtaining the final evaluation indexes for model performance, including Average accuracy mean mAP (mean Average precision), number of frames Per second detected FPS (frames Per second) and Recall rate (Recall).
S7, comparing the performance test result of the YOLOv3-tiny-SE network model obtained in the step S6 on the test set with the performance of YOLOv3-tiny to obtain a performance comparison result.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Various modifications and improvements of the technical solution of the present invention may be made by those skilled in the art without departing from the spirit of the present invention, and the technical solution of the present invention is to be covered by the protection scope defined by the claims.