CN112364974B

CN112364974B - YOLOv3 algorithm based on activation function improvement

Info

Publication number: CN112364974B
Application number: CN202010880785.1A
Authority: CN
Inventors: 王兰美; 朱衍波; 褚安亮; 廖桂生; 王桂宝; 孙长征; 贾建科
Original assignee: Xidian University; Shaanxi University of Technology
Current assignee: Xidian University; Shaanxi University of Technology
Priority date: 2020-08-28
Filing date: 2020-08-28
Publication date: 2024-02-09
Anticipated expiration: 2040-08-28
Also published as: CN112364974A

Abstract

The invention provides an improved YOLOv3 algorithm based on an activation function, and the average detection accuracy is improved. The activation function introduces a nonlinear characteristic activation function into the network, thereby ensuring the learning ability of the network. Firstly, preparing a universal PASCAL VOC data set in the current target detection field; secondly, reconstructing an existing algorithm YOLOv3 network model, adopting an Adam algorithm as an optimization algorithm in the training process, and detecting the performance of the model; then embedding the improved version activation function into a YOLOv3 algorithm model for training and evaluating the performance; and finally, comparing the classical YOLOv3 algorithm, and analyzing the test result. Compared with the classical YOLOv3 algorithm, the modified YOLOv3 algorithm based on the activation function provided by the invention has the advantages that on a general PASCAL VOC data set, mAP indexes are improved by nearly 1%, in addition, the module does not introduce more calculation amount, and the real-time performance is not affected compared with the original model. The module can be embedded into other classical algorithm models for comparison test, has higher applicability and has higher robustness.

Description

YOLOv3 algorithm based on activation function improvement

Technical Field

The invention belongs to the field of image recognition, and particularly relates to a YOLOv3 target detection algorithm based on an improved activation function, wherein the algorithm shows good detection performance on a general standard data set PASCAL VOC.

Background

The target detection mainly comprises a traditional target detection technology and a target detection technology based on deep learning, and in recent years, along with development and intelligent popularization of technologies, the traditional target detection technology can not meet the demands of people far, and the target detection technology based on the deep learning is generated and developed rapidly and becomes a mainstream algorithm in the current target detection field.

Target detection techniques based on deep learning can be broadly divided into two types of methods, one and two: the two-stage method mainly comprises algorithms based on candidate areas, such as R-CNN, fast-R-CNN and Fast-R-CNN, wherein the algorithms firstly generate a plurality of candidate areas on a picture, and then the candidate areas are subjected to candidate frame classification and regression through a convolutional neural network. The method has the highest precision, but the detection speed is low, and the real-time requirement cannot be met; the one-stage method uses a convolution neural network to directly predict the types and positions of different targets, belongs to an end-to-end method, and mainly comprises SSD and YOLO series.

The activation function (Activation Function) is a function running on neurons of an artificial neural network and is responsible for mapping inputs of the neurons to outputs. Common activation functions are: a Sigmoid function; a Tanh function; reLU function. The improved activation function is provided and embedded into the existing classical algorithm YOLOv3 to detect the performance of the activation function, the improved model is improved by nearly 1% in precision compared with an original model, and the module does not introduce more calculation amount and has no influence on real-time performance compared with the original model. The module can be embedded into other classical algorithm models for comparison test, and has more applicability.

Disclosure of Invention

The method of the invention provides an improved YOLOv3 algorithm based on an activation function, and the detection performance of the YOLOv3 algorithm is partially improved by embedding the improved activation function.

Step one: downloading a PASCAL VOC data set of a current target detection field general data set, ensuring to keep consistent with the field general data set so as to achieve a comparison effect, and detecting the performance of the method. The download address is: https:// pjreddie.com/projects/pascal-voc-dataset-mirror/.

The paspal VOC dataset provides 20 object categories. The picture in the data set used in the invention is the category information p marked with the target _i And the central position coordinates (x, y) of the target, the width w and the height h of the target are visualized by rectangular frames.

Step two: based on the improved activation function, the YOLOv3 network structure is reconstructed.

First, the initial weights of the network are randomized to follow a gaussian normal distribution, and then an RGB picture is input, which can be represented as a matrix form of a×a×3, where a is the width and height of the picture.

The input matrix is then composed of 52 convolutional layers, via the network structure built below, into three stages, i.e., three different scale outputs. Specifically, the product is represented by "x":

the size of the convolution kernel is 3 multiplied by 3, the step length is 2, and the number is 32 through the 1 st convolution layer, so that 208 multiplied by 32 feature map output is obtained; and entering a layer 2 convolution layer, wherein the convolution kernel is 3×3 in size, 1 in step length and 32 in number, so as to obtain 208×208×32 feature map output, and the like. According to different convolution kernels of each layer in the network diagram, the method respectively enters three different stages to sequentially obtain a 52×52×256 feature diagram, a 26×26×512 feature diagram and a 13×13×1024 feature diagram, and then enters feature fusion layers 1,2 and 3 to continuously perform feature fusion operations, wherein the feature fusion operations are respectively as follows:

the feature fusion layer 1 is a convolution module and comprises 5 steps of convolution operations, the convolution kernel sizes and the number are sequentially 1×1×128, 3×3×256, 1×1×128, 3×3×256 and 1×1×128, the step sizes are all 1, a feature map of 52×52×128 is obtained, and then the convolution operations of 3×3×75 and 1×1×75 are carried out, so that the feature map of 52×52×75 is finally obtained.

The feature fusion layer 2 is a convolution module and comprises 5 steps of convolution operations, the convolution kernel sizes and the number are sequentially 1×1×128, 3×3×256, 1×1×128, 3×3×256 and 1×1×128, the step sizes are all 1, a feature map of 26×26×128 is obtained, and then the convolution operations of 3×3×75 and 1×1×75 are carried out, so that a feature map of 26×26×75 is finally obtained.

The feature fusion layer 3 is a convolution module and comprises 5 steps of convolution operations, the convolution kernel sizes and the number are sequentially 1×1×128, 3×3×256, 1×1×128, 3×3×256 and 1×1×128, the step sizes are all 1, a feature map of 13×13×128 is obtained, and then the convolution operations of 3×3×75 and 1×1×75 are carried out, so that a feature map of 13×13×75 is finally obtained.

Wherein each convolution layer contains 3 operations:

the first step: convolving the feature map matrix input to the layer

And a second step of: carrying out batch normalization processing on the convolution result obtained in the last step, normalizing all data to be between 0 and 1 to obtain a normalized two-dimensional matrix, and being beneficial to accelerating training speed;

and a third step of: and taking the normalized two-dimensional matrix obtained in the last step as the input of an activation function to obtain the final output y of the layer.

The formula for the activation function is as follows:

y＝x×tanh(ln(1+e ^x ))

where x is the normalized two-dimensional matrix obtained in the previous step, tanh () is a hyperbolic tangent function, and y is the calculated value of each neuron after the activation function. The nonlinear characteristic activation function is introduced into the network, so that nonlinear mapping relation between input and output is ensured, and the nonlinear mapping relation is not a simple linear combination relation, so that the learning capability of the network can be ensured.

The output of the feature extraction module is three feature matrices, the dimensions of the three feature matrices are 52×52×75, 26×26×75 and 13×13×75 respectively, wherein the receptive field of each neuron in the feature matrix of 52×52×75 is minimum, and can be responsible for detecting small targets in the original input image, and similarly, the receptive field of each neuron in the feature matrix of 13×13×75 is maximum, and can be responsible for detecting large targets in the original input image. Thus, the multi-scale prediction is carried out, and the condition of missing detection of a small target can be avoided.

Taking a 13×13×75 feature map as an example, the first dimension 13 represents the number of horizontal pixels in the picture, the second dimension 13 represents the number of vertical pixels in the picture, the third dimension 75 represents the feature number of the target of interest, and includes 3 scale information, where each scale information includes 25 information points, and the 25 information points are respectively 4 coordinate information t of the prediction frame _xi ，t _yi ，t _wi ，t _hi Confidence of predictionAnd class probability->Wherein category information->Wherein (t) _xi 、t _yi ) Coordinate parameter value representing the center point of the i-th prediction frame, (t) _wi 、t _hi ) Parameter values representing the width and height of the ith prediction frame, prediction confidence +.>Representing the probability that the ith prediction box contains the target, class probability +.>The probability that the target of the ith prediction box is a certain class is represented as a multidimensional vector. Note that t _xi ，t _yi ，t _wi ，t _hi These four parameters are relative position coordinates, which need to be converted to final in-situActual coordinates in the starting picture. The formula of the conversion is as follows:

wherein t is _xi ，t _yi ，t _wi ，t _hi Is the predicted relative coordinate value, p _w 、p _h Representing the width and height of the prediction frame corresponding to the anchor frame c _x 、c _y Representing the offset of the prediction frame relative to the picture's upper left angular position coordinates,representing the actual coordinates of the center point of the prediction frame, +.>Representing the actual width and the actual height of the prediction frame.

Training the improved YOLOv3 model in a PASCAL VOC data set;

(1) The network randomly initializes the weight value, and the initialized value is subjected to Gaussian normal distribution.

(2) The input data is transmitted forwards through the network structure in the second step of the invention to obtain output values which are characteristic diagram 1, characteristic diagram 2 and characteristic diagram 3, and the information of the predicted frame is obtained by utilizing the information of the characteristic diagram;

(3) Matching the real frames marked in the data set with anchor frames obtained by clustering: calculating the center point of a real frame, screening out an anchor frame corresponding to the center point, selecting the anchor frame with the maximum IoU value with the real frame as a target frame, giving coordinate value information of the real frame to the target frame to obtain the coordinate value of the target frame, setting the class value of the target frame as 1, setting the confidence value as 1, and setting the parameter values of the rest untagged anchor frames as 0.

(4) And calculating error loss between the output value of the network prediction frame and the target value of the target frame by using the loss function.

The total loss function is:

loss＝center_loss+size_loss+confidence_loss+cls_loss

center_loss＝x_loss+y_loss

confidence_loss＝obj_loss+noobj_loss

where N represents the total number of network prediction frames, l _i ^obj Indicating whether or not there is a destination in the i-th prediction frameThe label, if present, l _i ^obj =1, otherwise 0; (x) _i ,y _i ) Indicating the true center position of the ith marking frame where the target is located,representing the center position of the i-th prediction frame, (w) _i ,h _i ) The true width and height of the ith marking frame where the target is located,/-for>Representing the width and height of the ith prediction box, α is used to adjust the proportion of scale loss to be occupied in all losses. C (C) _i Indicating the true confidence level of the ith marking frame where the target is located,/->Representing the confidence of the ith prediction box. P is p _i Representing the probability of the class of the object in the ith annotation box where the target is located, < >>Representing the class probability of the i-th prediction frame object.

(5) And updating the weight by using an Adam optimization algorithm until the iteration number is greater than epoch, and ending training. The main test index of the method is mAP (mean Average Precision), which represents average accuracy, firstly, the average accuracy is AP (Average Precision) in one category, and then the average accuracy of all the categories is averaged mAP (mean Average Precision).

Step four: and comparing the classical YOLOv3 algorithm, and analyzing the test result.

In the test process, the detection accuracy when IoU =0.5 is adopted as a measurement index of the performance of the algorithm, and if the intersection ratio between the predicted rectangular frame of a certain picture and the real rectangular frame of the picture is greater than 0.5, the algorithm is considered to be successful in detecting the picture.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following description will briefly explain the drawings needed in the embodiments or the prior art, and it is obvious that the drawings in the following description are only some embodiments of the present invention and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a diagram of the structure of a YOLOv3 network model;

FIG. 3 is a schematic diagram of a network training process;

FIG. 4 is a diagram of an activation function;

FIG. 5 is a graph of partial detection results of the original YOLOv3 model;

FIG. 6 is a graph comparing the results of the partial detection of original YOLOv3 with the model of the present invention;

FIG. 7 is an overall performance of the original YOLOv3 with the model of the invention on a validation data set;

Detailed Description

To make the above and other objects, features and advantages of the present invention more apparent, the following specific examples of the present invention are given together with the accompanying drawings, which are described in detail as follows:

referring to fig. 1, the implementation steps of the present invention are as follows:

Then, as shown in fig. 2, the input matrix is composed of 52 convolution layers via the network structure constructed below, and is divided into three stages, i.e., three different scale outputs. Specifically, the product is represented by "x":

Wherein each convolution layer contains 3 operations:

the first step: convolving the feature map matrix input to the layer

and a third step of: and taking the normalized two-dimensional matrix obtained in the last step as the input of an activation function to obtain the final output of the layer.

As shown in fig. 4, the formula for the activation function is as follows:

y＝x×tanh(ln(1+e ^x ))

Taking a 13×13×75 feature map as an example, the first dimension 13 represents the number of horizontal pixels in the picture, the second dimension 13 represents the number of vertical pixels in the picture, the third dimension 75 represents the feature number of the object of interest, and includes 3 scale information, where each scale information includes 25 information points, and the 25 information points are respectively 4 coordinate information t of the prediction frame _xi ，t _yi ，t _wi ，t _hi Confidence of predictionAnd class probability->Wherein category information->Wherein (t) _xi 、t _yi ) Coordinate parameter value representing the center point of the i-th prediction frame, (t) _wi 、t _hi ) Parameter values representing the width and height of the ith prediction frame, prediction confidence +.>Representing the probability that the ith prediction box contains the target, class probability +.>The probability that the target of the ith prediction box is a certain class is represented as a multidimensional vector. Note that t _xi ，t _yi ，t _wi ，t _hi These four parameters are relative position coordinates that need to be translated into actual coordinates that are ultimately in the original picture. The formula of the conversion is as follows:

Training the model in the PASCAL VOC data set as shown in fig. 3;

The total loss function is:

loss＝center_loss+size_loss+confidence_loss+cls_loss

center_loss＝x_loss+y_loss

confidence_loss＝obj_loss+noobj_loss

where N represents the total number of network prediction frames, l _i ^obj Indicating whether or not there is a target in the i-th prediction frame, if so, l _i ^obj =1, otherwise 0; (x) _i ,y _i ) Indicating the true center position of the ith marking frame where the target is located,representing the center position of the i-th prediction frame, (w) _i ,h _i ) The true width and height of the ith marking frame where the target is located,/-for>Representing the width and height of the ith prediction box, α is used to adjust the proportion of scale loss to be occupied in all losses. C (C) _i Indicating the true confidence level of the ith marking frame where the target is located,/->Representing the confidence of the ith prediction box. P is p _i Representing the probability of the class of the object in the ith annotation box where the target is located, < >>Representing the class probability of the i-th prediction frame object.

The invention is further described below in connection with simulation examples.

Simulation example:

the invention adopts the original YOLOv3 model as a comparison model, adopts the PASCAL VOC data set as a training set and a testing set and gives out a partial detection effect graph.

Fig. 5 is a partial detection result diagram of the original YOLOv3 model, and pictures with different backgrounds, different categories and different sizes are selected, so that it can be seen that the basic category detection effect of the object in the pictures is good.

FIG. 6 is a graph comparing the results of the original YOLOv3 and the model part of the present invention, wherein the left side is a graph showing the results of the original YOLOv3 model, and the right side is a graph showing the results of the model of the present invention.

For the left row, it can be seen that the detection effect is not very good under the conditions that the target categories are similar, the targets are overlapped and the targets are densely distributed in the picture by the original YOLOv3 model, as in fig. 6, three sheep are detected as cattle by mistake; two bear, but only one detected; the apples were densely distributed but only two were shown to be detected.

Comparing the left row, it can be seen that, for the three cases mentioned above, after the improvement of the method of the present invention, three shaeps are successfully detected for the case of similar target class; for the situation of overlapping targets, two bars can be accurately detected; for the condition of dense distribution of targets, more accurate detection can be given, and recall rate is ensured.

FIG. 7 is an illustration of the overall performance of the original YOLOv3 and model of the invention on a validation dataset, AP being flat for each class Average accuracy, mAP is the average accuracy of all classes, it can be seen that the model of the invention is on the verification set compared with the original YOLOv3 model The average accuracy mAP is higher.

As proved by a comprehensive simulation experiment, the improved model has more excellent detection precision compared with the YOLOv3, the improved model is improved by nearly 1% in precision, and in addition, the module does not introduce more calculation amount, and compared with the original model, the real-time performance is not affected. The module can be embedded into other classical algorithm models for comparison test, and has more applicability.

The present invention is not limited to the above-mentioned embodiments, but is not limited to the above-mentioned embodiments, and any simple modification, equivalent changes and modification made to the above-mentioned embodiments according to the technical matters of the present invention can be made by those skilled in the art without departing from the scope of the present invention.

Claims

1. A YOLOv3 algorithm based on an activation function improvement, comprising the steps of:

step one: downloading a PASCAL VOC data set of a current target detection field general data set, ensuring that the PASCAL VOC data set is consistent with the field general data set so as to achieve a comparison effect, and detecting the performance of the method;

step two: reconstructing the YOLOv3 network structure based on the improved activation function;

firstly, randomizing initial weights of a network to enable the initial weights to follow Gaussian normal distribution, and then inputting an RGB picture which can be expressed as a matrix form of a multiplied by 3, wherein a is the width and the height of the picture;

the input matrix is then divided into three phases, i.e. three different scale outputs, by way of a network structure built below, consisting of 52 convolutional layers; specifically, the product is represented by "x":

the size of the convolution kernel is 3 multiplied by 3, the step length is 2, and the number is 32 through the 1 st convolution layer, so that 208 multiplied by 32 feature map output is obtained; entering a 2 nd convolution layer, wherein the convolution kernel size is 3 multiplied by 3, the step length is 1, the number is 32, and the output of the feature map of 208 multiplied by 32 is obtained, so that the method is similar to the method; according to different convolution kernels of each layer in the network diagram, the method respectively enters three different stages to sequentially obtain a 52×52×256 feature diagram, a 26×26×512 feature diagram and a 13×13×1024 feature diagram, and then enters feature fusion layers 1,2 and 3 to continuously perform feature fusion operations, wherein the feature fusion operations are respectively as follows:

the feature fusion layer 1 is a convolution module and comprises 5 steps of convolution operations, the sizes and the number of convolution kernels are 1×1×128, 3×3×256, 1×1×128, 3×3×256 and 1×1×128 in sequence, the step sizes are 1, a feature map of 52×52×128 is obtained, and then the convolution operations of 3×3×75 and 1×1×75 are carried out, so that a feature map of 52×52×75 is finally obtained;

the feature fusion layer 2 is a convolution module and comprises 5 steps of convolution operations, the sizes and the number of convolution kernels are 1×1×128, 3×3×256, 1×1×128, 3×3×256 and 1×1×128 in sequence, the step sizes are 1, a feature map of 26×26×128 is obtained, and then the convolution operations of 3×3×75 and 1×1×75 are carried out, so that a feature map of 26×26×75 is finally obtained;

the feature fusion layer 3 is a convolution module and comprises 5 steps of convolution operations, the sizes and the number of convolution kernels are 1×1×128, 3×3×256, 1×1×128, 3×3×256 and 1×1×128 in sequence, the step sizes are 1, a feature map of 13×13×128 is obtained, then the convolution operations of 3×3×75 and 1×1×75 are carried out, and finally a feature map of 13×13×75 is obtained;

wherein each convolution layer contains 3 operations:

the first step: performing convolution operation on the feature map matrix input into the layer;

and a third step of: taking the normalized two-dimensional matrix obtained in the last step as the input of an activation function to obtain the final output of the layer;

the formula for the activation function is as follows:

y＝x×tanh(ln(1+e ^x ))

wherein x is a normalized two-dimensional matrix obtained in the last step, tanh () is a hyperbolic tangent function, and y is a calculated value of each neuron after the activation function; the nonlinear characteristic activation function is introduced into the network, so that nonlinear mapping relation between input and output is ensured, and the nonlinear mapping relation is not a simple linear combination relation, so that the learning capacity of the network can be ensured;

the output of the feature extraction module is three feature matrices, the dimensions of the three feature matrices are 52×52×75, 26×26×75 and 13×13×75 respectively, wherein the receptive field of each neuron in the feature matrix of 52×52×75 is minimum, and can be responsible for detecting a small target in an original input image, and similarly, the receptive field of each neuron in the feature matrix of 13×13×75 is maximum, and can be responsible for detecting a large target in the original input image; thus, the multi-scale prediction is carried out, and the condition of missing detection of a small target can be avoided;

taking a 13×13×75 feature map as an example, the first dimension 13 represents the number of horizontal pixels in the picture, the second dimension 13 represents the number of vertical pixels in the picture, the third dimension 75 represents the feature number of the object of interest, and includes 3 scale information, where each scale information includes 25 information points, and the 25 information points are respectively 4 coordinate information t of the prediction frame _xi ，t _yi ，t _wi ，t _hi Confidence of predictionAnd class probability->Wherein category information->Wherein (t) _xi 、t _yi ) Coordinate parameter value representing the center point of the i-th prediction frame, (t) _wi 、t _hi ) Parameter values representing the width and height of the ith prediction frame, prediction confidence +.>Representing the probability that the ith prediction box contains the target, class probability +.>For multidimensional vectors, the probability that the target of the ith prediction frame is a certain category is represented; note that t _xi ，t _yi ，t _wi ，t _hi The four parameters are relative position coordinates, which need to be converted into actual coordinates in the original picture; the formula of the conversion is as follows:

wherein t is _xi ，t _yi ，t _wi ，t _hi Is the predicted relative coordinate value, p _w 、p _h Representing the width and height of the prediction frame corresponding to the anchor frame c _x 、c _y Representing the offset of the prediction frame relative to the picture's upper left angular position coordinates,representing the actual coordinates of the center point of the prediction frame,representing the actual width and the actual height of the prediction frame;

step three: training the model in a PASCAL VOC data set;

(1) Randomly initializing a weight by a network to enable the initialized weight to be subjected to Gaussian normal distribution;

(3) Matching the real frames marked in the data set with anchor frames obtained by clustering: calculating a center point of a real frame, screening an anchor frame corresponding to the center point, selecting the anchor frame with the maximum IoU value with the real frame as a target frame, giving coordinate value information of the real frame to the target frame to obtain the coordinate value of the target frame, setting the class value of the target frame as 1, setting the confidence value as 1, and setting the parameter values of the rest untagged anchor frames as 0;

(4) Solving error loss between the output value of the network prediction frame and the target value of the target frame by using a loss function;

the total loss function is:

loss＝center_loss+size_loss+confidence_loss+cls_loss

center_loss＝x_loss+y_loss

confidence_loss＝obj_loss+noobj_loss

where N represents the total number of network prediction frames, l _i ^obj Indicating whether or not there is a target in the i-th prediction frame, if so, l _i ^obj =1, otherwise 0; (x) _i ,y _i ) Indicating the true center position of the ith marking frame where the target is located,representing the center position of the i-th prediction frame, (w) _i ,h _i ) The true width and height of the ith marking frame where the target is located,/-for>Representing the width and height of the ith prediction box, α is used to adjust the proportion of scale loss to be occupied in all losses; c (C) _i Indicating that the ith marking frame where the target is located is trueConfidence level (confidence)>Representing the confidence of the ith prediction frame; p is p _i Representing the probability of the class of the object in the ith annotation box where the target is located, < >>Representing class probabilities of the ith prediction frame object;

(5) Updating the weight by using an Adam optimization algorithm until the iteration number is greater than epoch, and ending training; the main test index of the method is mAP (mean Average Precision), which represents average accuracy, firstly, the average accuracy is AP (Average Precision) in one category, and then the average accuracy of all categories is averaged mAP (mean Average Precision);

2. A YOLOv3 algorithm based on an improvement of an activation function according to claim 1, step one: downloading a current target detection field general data set VOC data set, wherein the PASCAL VOC data set provides 20 object categories; the picture in the data set used in the invention is the category information p marked with the target _i And the central position coordinates (x, y) of the target, the width w and the height h of the target are visualized by rectangular frames.

3. A YOLOv3 algorithm based on an improvement of an activation function according to claim 1, step four: and comparing the classical YOLOv3 algorithm, and analyzing the test result.