CN112364974B - YOLOv3 algorithm based on activation function improvement - Google Patents
YOLOv3 algorithm based on activation function improvement Download PDFInfo
- Publication number
- CN112364974B CN112364974B CN202010880785.1A CN202010880785A CN112364974B CN 112364974 B CN112364974 B CN 112364974B CN 202010880785 A CN202010880785 A CN 202010880785A CN 112364974 B CN112364974 B CN 112364974B
- Authority
- CN
- China
- Prior art keywords
- frame
- target
- convolution
- feature
- loss
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000004422 calculation algorithm Methods 0.000 title claims abstract description 35
- 230000004913 activation Effects 0.000 title claims abstract description 34
- 230000006872 improvement Effects 0.000 title claims description 6
- 230000006870 function Effects 0.000 claims abstract description 46
- 238000001514 detection method Methods 0.000 claims abstract description 30
- 238000000034 method Methods 0.000 claims abstract description 23
- 238000012360 testing method Methods 0.000 claims abstract description 14
- 238000012549 training Methods 0.000 claims abstract description 13
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 claims abstract description 4
- 238000005457 optimization Methods 0.000 claims abstract description 4
- 238000010586 diagram Methods 0.000 claims description 28
- 239000011159 matrix material Substances 0.000 claims description 24
- 230000004927 fusion Effects 0.000 claims description 18
- 210000002569 neuron Anatomy 0.000 claims description 11
- 238000013507 mapping Methods 0.000 claims description 7
- 230000000694 effects Effects 0.000 claims description 6
- 230000009286 beneficial effect Effects 0.000 claims description 3
- 238000006243 chemical reaction Methods 0.000 claims description 3
- 238000000605 extraction Methods 0.000 claims description 3
- 238000010606 normalization Methods 0.000 claims description 3
- 238000012545 processing Methods 0.000 claims description 3
- 238000012216 screening Methods 0.000 claims description 3
- 239000013598 vector Substances 0.000 claims description 3
- 230000008569 process Effects 0.000 abstract description 4
- 238000004364 calculation method Methods 0.000 abstract description 3
- 238000005516 engineering process Methods 0.000 description 5
- 238000013527 convolutional neural network Methods 0.000 description 4
- 238000013135 deep learning Methods 0.000 description 3
- 238000004088 simulation Methods 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000005259 measurement Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000010200 validation analysis Methods 0.000 description 2
- 241000283690 Bos taurus Species 0.000 description 1
- 244000141359 Malus pumila Species 0.000 description 1
- 241001494479 Pecora Species 0.000 description 1
- 235000021016 apples Nutrition 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Image Analysis (AREA)
Abstract
The invention provides an improved YOLOv3 algorithm based on an activation function, and the average detection accuracy is improved. The activation function introduces a nonlinear characteristic activation function into the network, thereby ensuring the learning ability of the network. Firstly, preparing a universal PASCAL VOC data set in the current target detection field; secondly, reconstructing an existing algorithm YOLOv3 network model, adopting an Adam algorithm as an optimization algorithm in the training process, and detecting the performance of the model; then embedding the improved version activation function into a YOLOv3 algorithm model for training and evaluating the performance; and finally, comparing the classical YOLOv3 algorithm, and analyzing the test result. Compared with the classical YOLOv3 algorithm, the modified YOLOv3 algorithm based on the activation function provided by the invention has the advantages that on a general PASCAL VOC data set, mAP indexes are improved by nearly 1%, in addition, the module does not introduce more calculation amount, and the real-time performance is not affected compared with the original model. The module can be embedded into other classical algorithm models for comparison test, has higher applicability and has higher robustness.
Description
Technical Field
The invention belongs to the field of image recognition, and particularly relates to a YOLOv3 target detection algorithm based on an improved activation function, wherein the algorithm shows good detection performance on a general standard data set PASCAL VOC.
Background
The target detection mainly comprises a traditional target detection technology and a target detection technology based on deep learning, and in recent years, along with development and intelligent popularization of technologies, the traditional target detection technology can not meet the demands of people far, and the target detection technology based on the deep learning is generated and developed rapidly and becomes a mainstream algorithm in the current target detection field.
Target detection techniques based on deep learning can be broadly divided into two types of methods, one and two: the two-stage method mainly comprises algorithms based on candidate areas, such as R-CNN, fast-R-CNN and Fast-R-CNN, wherein the algorithms firstly generate a plurality of candidate areas on a picture, and then the candidate areas are subjected to candidate frame classification and regression through a convolutional neural network. The method has the highest precision, but the detection speed is low, and the real-time requirement cannot be met; the one-stage method uses a convolution neural network to directly predict the types and positions of different targets, belongs to an end-to-end method, and mainly comprises SSD and YOLO series.
The activation function (Activation Function) is a function running on neurons of an artificial neural network and is responsible for mapping inputs of the neurons to outputs. Common activation functions are: a Sigmoid function; a Tanh function; reLU function. The improved activation function is provided and embedded into the existing classical algorithm YOLOv3 to detect the performance of the activation function, the improved model is improved by nearly 1% in precision compared with an original model, and the module does not introduce more calculation amount and has no influence on real-time performance compared with the original model. The module can be embedded into other classical algorithm models for comparison test, and has more applicability.
Disclosure of Invention
The method of the invention provides an improved YOLOv3 algorithm based on an activation function, and the detection performance of the YOLOv3 algorithm is partially improved by embedding the improved activation function.
Step one: downloading a PASCAL VOC data set of a current target detection field general data set, ensuring to keep consistent with the field general data set so as to achieve a comparison effect, and detecting the performance of the method. The download address is: https:// pjreddie.com/projects/pascal-voc-dataset-mirror/.
The paspal VOC dataset provides 20 object categories. The picture in the data set used in the invention is the category information p marked with the target i And the central position coordinates (x, y) of the target, the width w and the height h of the target are visualized by rectangular frames.
Step two: based on the improved activation function, the YOLOv3 network structure is reconstructed.
First, the initial weights of the network are randomized to follow a gaussian normal distribution, and then an RGB picture is input, which can be represented as a matrix form of a×a×3, where a is the width and height of the picture.
The input matrix is then composed of 52 convolutional layers, via the network structure built below, into three stages, i.e., three different scale outputs. Specifically, the product is represented by "x":
the size of the convolution kernel is 3 multiplied by 3, the step length is 2, and the number is 32 through the 1 st convolution layer, so that 208 multiplied by 32 feature map output is obtained; and entering a layer 2 convolution layer, wherein the convolution kernel is 3×3 in size, 1 in step length and 32 in number, so as to obtain 208×208×32 feature map output, and the like. According to different convolution kernels of each layer in the network diagram, the method respectively enters three different stages to sequentially obtain a 52×52×256 feature diagram, a 26×26×512 feature diagram and a 13×13×1024 feature diagram, and then enters feature fusion layers 1,2 and 3 to continuously perform feature fusion operations, wherein the feature fusion operations are respectively as follows:
the feature fusion layer 1 is a convolution module and comprises 5 steps of convolution operations, the convolution kernel sizes and the number are sequentially 1×1×128, 3×3×256, 1×1×128, 3×3×256 and 1×1×128, the step sizes are all 1, a feature map of 52×52×128 is obtained, and then the convolution operations of 3×3×75 and 1×1×75 are carried out, so that the feature map of 52×52×75 is finally obtained.
The feature fusion layer 2 is a convolution module and comprises 5 steps of convolution operations, the convolution kernel sizes and the number are sequentially 1×1×128, 3×3×256, 1×1×128, 3×3×256 and 1×1×128, the step sizes are all 1, a feature map of 26×26×128 is obtained, and then the convolution operations of 3×3×75 and 1×1×75 are carried out, so that a feature map of 26×26×75 is finally obtained.
The feature fusion layer 3 is a convolution module and comprises 5 steps of convolution operations, the convolution kernel sizes and the number are sequentially 1×1×128, 3×3×256, 1×1×128, 3×3×256 and 1×1×128, the step sizes are all 1, a feature map of 13×13×128 is obtained, and then the convolution operations of 3×3×75 and 1×1×75 are carried out, so that a feature map of 13×13×75 is finally obtained.
Wherein each convolution layer contains 3 operations:
the first step: convolving the feature map matrix input to the layer
And a second step of: carrying out batch normalization processing on the convolution result obtained in the last step, normalizing all data to be between 0 and 1 to obtain a normalized two-dimensional matrix, and being beneficial to accelerating training speed;
and a third step of: and taking the normalized two-dimensional matrix obtained in the last step as the input of an activation function to obtain the final output y of the layer.
The formula for the activation function is as follows:
y=x×tanh(ln(1+e x ))
where x is the normalized two-dimensional matrix obtained in the previous step, tanh () is a hyperbolic tangent function, and y is the calculated value of each neuron after the activation function. The nonlinear characteristic activation function is introduced into the network, so that nonlinear mapping relation between input and output is ensured, and the nonlinear mapping relation is not a simple linear combination relation, so that the learning capability of the network can be ensured.
The output of the feature extraction module is three feature matrices, the dimensions of the three feature matrices are 52×52×75, 26×26×75 and 13×13×75 respectively, wherein the receptive field of each neuron in the feature matrix of 52×52×75 is minimum, and can be responsible for detecting small targets in the original input image, and similarly, the receptive field of each neuron in the feature matrix of 13×13×75 is maximum, and can be responsible for detecting large targets in the original input image. Thus, the multi-scale prediction is carried out, and the condition of missing detection of a small target can be avoided.
Taking a 13×13×75 feature map as an example, the first dimension 13 represents the number of horizontal pixels in the picture, the second dimension 13 represents the number of vertical pixels in the picture, the third dimension 75 represents the feature number of the target of interest, and includes 3 scale information, where each scale information includes 25 information points, and the 25 information points are respectively 4 coordinate information t of the prediction frame xi ,t yi ,t wi ,t hi Confidence of predictionAnd class probability->Wherein category information->Wherein (t) xi 、t yi ) Coordinate parameter value representing the center point of the i-th prediction frame, (t) wi 、t hi ) Parameter values representing the width and height of the ith prediction frame, prediction confidence +.>Representing the probability that the ith prediction box contains the target, class probability +.>The probability that the target of the ith prediction box is a certain class is represented as a multidimensional vector. Note that t xi ,t yi ,t wi ,t hi These four parameters are relative position coordinates, which need to be converted to final in-situActual coordinates in the starting picture. The formula of the conversion is as follows:
wherein t is xi ,t yi ,t wi ,t hi Is the predicted relative coordinate value, p w 、p h Representing the width and height of the prediction frame corresponding to the anchor frame c x 、c y Representing the offset of the prediction frame relative to the picture's upper left angular position coordinates,representing the actual coordinates of the center point of the prediction frame, +.>Representing the actual width and the actual height of the prediction frame.
Training the improved YOLOv3 model in a PASCAL VOC data set;
(1) The network randomly initializes the weight value, and the initialized value is subjected to Gaussian normal distribution.
(2) The input data is transmitted forwards through the network structure in the second step of the invention to obtain output values which are characteristic diagram 1, characteristic diagram 2 and characteristic diagram 3, and the information of the predicted frame is obtained by utilizing the information of the characteristic diagram;
(3) Matching the real frames marked in the data set with anchor frames obtained by clustering: calculating the center point of a real frame, screening out an anchor frame corresponding to the center point, selecting the anchor frame with the maximum IoU value with the real frame as a target frame, giving coordinate value information of the real frame to the target frame to obtain the coordinate value of the target frame, setting the class value of the target frame as 1, setting the confidence value as 1, and setting the parameter values of the rest untagged anchor frames as 0.
(4) And calculating error loss between the output value of the network prediction frame and the target value of the target frame by using the loss function.
The total loss function is:
loss=center_loss+size_loss+confidence_loss+cls_loss
center_loss=x_loss+y_loss
confidence_loss=obj_loss+noobj_loss
where N represents the total number of network prediction frames, l i obj Indicating whether or not there is a destination in the i-th prediction frameThe label, if present, l i obj =1, otherwise 0; (x) i ,y i ) Indicating the true center position of the ith marking frame where the target is located,representing the center position of the i-th prediction frame, (w) i ,h i ) The true width and height of the ith marking frame where the target is located,/-for>Representing the width and height of the ith prediction box, α is used to adjust the proportion of scale loss to be occupied in all losses. C (C) i Indicating the true confidence level of the ith marking frame where the target is located,/->Representing the confidence of the ith prediction box. P is p i Representing the probability of the class of the object in the ith annotation box where the target is located, < >>Representing the class probability of the i-th prediction frame object.
(5) And updating the weight by using an Adam optimization algorithm until the iteration number is greater than epoch, and ending training. The main test index of the method is mAP (mean Average Precision), which represents average accuracy, firstly, the average accuracy is AP (Average Precision) in one category, and then the average accuracy of all the categories is averaged mAP (mean Average Precision).
Step four: and comparing the classical YOLOv3 algorithm, and analyzing the test result.
In the test process, the detection accuracy when IoU =0.5 is adopted as a measurement index of the performance of the algorithm, and if the intersection ratio between the predicted rectangular frame of a certain picture and the real rectangular frame of the picture is greater than 0.5, the algorithm is considered to be successful in detecting the picture.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following description will briefly explain the drawings needed in the embodiments or the prior art, and it is obvious that the drawings in the following description are only some embodiments of the present invention and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of the method of the present invention;
FIG. 2 is a diagram of the structure of a YOLOv3 network model;
FIG. 3 is a schematic diagram of a network training process;
FIG. 4 is a diagram of an activation function;
FIG. 5 is a graph of partial detection results of the original YOLOv3 model;
FIG. 6 is a graph comparing the results of the partial detection of original YOLOv3 with the model of the present invention;
FIG. 7 is an overall performance of the original YOLOv3 with the model of the invention on a validation data set;
Detailed Description
To make the above and other objects, features and advantages of the present invention more apparent, the following specific examples of the present invention are given together with the accompanying drawings, which are described in detail as follows:
referring to fig. 1, the implementation steps of the present invention are as follows:
step one: downloading a PASCAL VOC data set of a current target detection field general data set, ensuring to keep consistent with the field general data set so as to achieve a comparison effect, and detecting the performance of the method. The download address is: https:// pjreddie.com/projects/pascal-voc-dataset-mirror/.
The paspal VOC dataset provides 20 object categories. The picture in the data set used in the invention is the category information p marked with the target i And the central position coordinates (x, y) of the target, the width w and the height h of the target are visualized by rectangular frames.
Step two: based on the improved activation function, the YOLOv3 network structure is reconstructed.
First, the initial weights of the network are randomized to follow a gaussian normal distribution, and then an RGB picture is input, which can be represented as a matrix form of a×a×3, where a is the width and height of the picture.
Then, as shown in fig. 2, the input matrix is composed of 52 convolution layers via the network structure constructed below, and is divided into three stages, i.e., three different scale outputs. Specifically, the product is represented by "x":
the size of the convolution kernel is 3 multiplied by 3, the step length is 2, and the number is 32 through the 1 st convolution layer, so that 208 multiplied by 32 feature map output is obtained; and entering a layer 2 convolution layer, wherein the convolution kernel is 3×3 in size, 1 in step length and 32 in number, so as to obtain 208×208×32 feature map output, and the like. According to different convolution kernels of each layer in the network diagram, the method respectively enters three different stages to sequentially obtain a 52×52×256 feature diagram, a 26×26×512 feature diagram and a 13×13×1024 feature diagram, and then enters feature fusion layers 1,2 and 3 to continuously perform feature fusion operations, wherein the feature fusion operations are respectively as follows:
the feature fusion layer 1 is a convolution module and comprises 5 steps of convolution operations, the convolution kernel sizes and the number are sequentially 1×1×128, 3×3×256, 1×1×128, 3×3×256 and 1×1×128, the step sizes are all 1, a feature map of 52×52×128 is obtained, and then the convolution operations of 3×3×75 and 1×1×75 are carried out, so that the feature map of 52×52×75 is finally obtained.
The feature fusion layer 2 is a convolution module and comprises 5 steps of convolution operations, the convolution kernel sizes and the number are sequentially 1×1×128, 3×3×256, 1×1×128, 3×3×256 and 1×1×128, the step sizes are all 1, a feature map of 26×26×128 is obtained, and then the convolution operations of 3×3×75 and 1×1×75 are carried out, so that a feature map of 26×26×75 is finally obtained.
The feature fusion layer 3 is a convolution module and comprises 5 steps of convolution operations, the convolution kernel sizes and the number are sequentially 1×1×128, 3×3×256, 1×1×128, 3×3×256 and 1×1×128, the step sizes are all 1, a feature map of 13×13×128 is obtained, and then the convolution operations of 3×3×75 and 1×1×75 are carried out, so that a feature map of 13×13×75 is finally obtained.
Wherein each convolution layer contains 3 operations:
the first step: convolving the feature map matrix input to the layer
And a second step of: carrying out batch normalization processing on the convolution result obtained in the last step, normalizing all data to be between 0 and 1 to obtain a normalized two-dimensional matrix, and being beneficial to accelerating training speed;
and a third step of: and taking the normalized two-dimensional matrix obtained in the last step as the input of an activation function to obtain the final output of the layer.
As shown in fig. 4, the formula for the activation function is as follows:
y=x×tanh(ln(1+e x ))
where x is the normalized two-dimensional matrix obtained in the previous step, tanh () is a hyperbolic tangent function, and y is the calculated value of each neuron after the activation function. The nonlinear characteristic activation function is introduced into the network, so that nonlinear mapping relation between input and output is ensured, and the nonlinear mapping relation is not a simple linear combination relation, so that the learning capability of the network can be ensured.
The output of the feature extraction module is three feature matrices, the dimensions of the three feature matrices are 52×52×75, 26×26×75 and 13×13×75 respectively, wherein the receptive field of each neuron in the feature matrix of 52×52×75 is minimum, and can be responsible for detecting small targets in the original input image, and similarly, the receptive field of each neuron in the feature matrix of 13×13×75 is maximum, and can be responsible for detecting large targets in the original input image. Thus, the multi-scale prediction is carried out, and the condition of missing detection of a small target can be avoided.
Taking a 13×13×75 feature map as an example, the first dimension 13 represents the number of horizontal pixels in the picture, the second dimension 13 represents the number of vertical pixels in the picture, the third dimension 75 represents the feature number of the object of interest, and includes 3 scale information, where each scale information includes 25 information points, and the 25 information points are respectively 4 coordinate information t of the prediction frame xi ,t yi ,t wi ,t hi Confidence of predictionAnd class probability->Wherein category information->Wherein (t) xi 、t yi ) Coordinate parameter value representing the center point of the i-th prediction frame, (t) wi 、t hi ) Parameter values representing the width and height of the ith prediction frame, prediction confidence +.>Representing the probability that the ith prediction box contains the target, class probability +.>The probability that the target of the ith prediction box is a certain class is represented as a multidimensional vector. Note that t xi ,t yi ,t wi ,t hi These four parameters are relative position coordinates that need to be translated into actual coordinates that are ultimately in the original picture. The formula of the conversion is as follows:
wherein t is xi ,t yi ,t wi ,t hi Is the predicted relative coordinate value, p w 、p h Representing the width and height of the prediction frame corresponding to the anchor frame c x 、c y Representing the offset of the prediction frame relative to the picture's upper left angular position coordinates,representing the actual coordinates of the center point of the prediction frame, +.>Representing the actual width and the actual height of the prediction frame.
Training the model in the PASCAL VOC data set as shown in fig. 3;
(1) The network randomly initializes the weight value, and the initialized value is subjected to Gaussian normal distribution.
(2) The input data is transmitted forwards through the network structure in the second step of the invention to obtain output values which are characteristic diagram 1, characteristic diagram 2 and characteristic diagram 3, and the information of the predicted frame is obtained by utilizing the information of the characteristic diagram;
(3) Matching the real frames marked in the data set with anchor frames obtained by clustering: calculating the center point of a real frame, screening out an anchor frame corresponding to the center point, selecting the anchor frame with the maximum IoU value with the real frame as a target frame, giving coordinate value information of the real frame to the target frame to obtain the coordinate value of the target frame, setting the class value of the target frame as 1, setting the confidence value as 1, and setting the parameter values of the rest untagged anchor frames as 0.
(4) And calculating error loss between the output value of the network prediction frame and the target value of the target frame by using the loss function.
The total loss function is:
loss=center_loss+size_loss+confidence_loss+cls_loss
center_loss=x_loss+y_loss
confidence_loss=obj_loss+noobj_loss
where N represents the total number of network prediction frames, l i obj Indicating whether or not there is a target in the i-th prediction frame, if so, l i obj =1, otherwise 0; (x) i ,y i ) Indicating the true center position of the ith marking frame where the target is located,representing the center position of the i-th prediction frame, (w) i ,h i ) The true width and height of the ith marking frame where the target is located,/-for>Representing the width and height of the ith prediction box, α is used to adjust the proportion of scale loss to be occupied in all losses. C (C) i Indicating the true confidence level of the ith marking frame where the target is located,/->Representing the confidence of the ith prediction box. P is p i Representing the probability of the class of the object in the ith annotation box where the target is located, < >>Representing the class probability of the i-th prediction frame object.
(5) And updating the weight by using an Adam optimization algorithm until the iteration number is greater than epoch, and ending training. The main test index of the method is mAP (mean Average Precision), which represents average accuracy, firstly, the average accuracy is AP (Average Precision) in one category, and then the average accuracy of all the categories is averaged mAP (mean Average Precision).
Step four: and comparing the classical YOLOv3 algorithm, and analyzing the test result.
In the test process, the detection accuracy when IoU =0.5 is adopted as a measurement index of the performance of the algorithm, and if the intersection ratio between the predicted rectangular frame of a certain picture and the real rectangular frame of the picture is greater than 0.5, the algorithm is considered to be successful in detecting the picture.
The invention is further described below in connection with simulation examples.
Simulation example:
the invention adopts the original YOLOv3 model as a comparison model, adopts the PASCAL VOC data set as a training set and a testing set and gives out a partial detection effect graph.
Fig. 5 is a partial detection result diagram of the original YOLOv3 model, and pictures with different backgrounds, different categories and different sizes are selected, so that it can be seen that the basic category detection effect of the object in the pictures is good.
FIG. 6 is a graph comparing the results of the original YOLOv3 and the model part of the present invention, wherein the left side is a graph showing the results of the original YOLOv3 model, and the right side is a graph showing the results of the model of the present invention.
For the left row, it can be seen that the detection effect is not very good under the conditions that the target categories are similar, the targets are overlapped and the targets are densely distributed in the picture by the original YOLOv3 model, as in fig. 6, three sheep are detected as cattle by mistake; two bear, but only one detected; the apples were densely distributed but only two were shown to be detected.
Comparing the left row, it can be seen that, for the three cases mentioned above, after the improvement of the method of the present invention, three shaeps are successfully detected for the case of similar target class; for the situation of overlapping targets, two bars can be accurately detected; for the condition of dense distribution of targets, more accurate detection can be given, and recall rate is ensured.
FIG. 7 is an illustration of the overall performance of the original YOLOv3 and model of the invention on a validation dataset, AP being flat for each class
Average accuracy, mAP is the average accuracy of all classes, it can be seen that the model of the invention is on the verification set compared with the original YOLOv3 model
The average accuracy mAP is higher.
As proved by a comprehensive simulation experiment, the improved model has more excellent detection precision compared with the YOLOv3, the improved model is improved by nearly 1% in precision, and in addition, the module does not introduce more calculation amount, and compared with the original model, the real-time performance is not affected. The module can be embedded into other classical algorithm models for comparison test, and has more applicability.
The present invention is not limited to the above-mentioned embodiments, but is not limited to the above-mentioned embodiments, and any simple modification, equivalent changes and modification made to the above-mentioned embodiments according to the technical matters of the present invention can be made by those skilled in the art without departing from the scope of the present invention.
Claims (3)
1. A YOLOv3 algorithm based on an activation function improvement, comprising the steps of:
step one: downloading a PASCAL VOC data set of a current target detection field general data set, ensuring that the PASCAL VOC data set is consistent with the field general data set so as to achieve a comparison effect, and detecting the performance of the method;
step two: reconstructing the YOLOv3 network structure based on the improved activation function;
firstly, randomizing initial weights of a network to enable the initial weights to follow Gaussian normal distribution, and then inputting an RGB picture which can be expressed as a matrix form of a multiplied by 3, wherein a is the width and the height of the picture;
the input matrix is then divided into three phases, i.e. three different scale outputs, by way of a network structure built below, consisting of 52 convolutional layers; specifically, the product is represented by "x":
the size of the convolution kernel is 3 multiplied by 3, the step length is 2, and the number is 32 through the 1 st convolution layer, so that 208 multiplied by 32 feature map output is obtained; entering a 2 nd convolution layer, wherein the convolution kernel size is 3 multiplied by 3, the step length is 1, the number is 32, and the output of the feature map of 208 multiplied by 32 is obtained, so that the method is similar to the method; according to different convolution kernels of each layer in the network diagram, the method respectively enters three different stages to sequentially obtain a 52×52×256 feature diagram, a 26×26×512 feature diagram and a 13×13×1024 feature diagram, and then enters feature fusion layers 1,2 and 3 to continuously perform feature fusion operations, wherein the feature fusion operations are respectively as follows:
the feature fusion layer 1 is a convolution module and comprises 5 steps of convolution operations, the sizes and the number of convolution kernels are 1×1×128, 3×3×256, 1×1×128, 3×3×256 and 1×1×128 in sequence, the step sizes are 1, a feature map of 52×52×128 is obtained, and then the convolution operations of 3×3×75 and 1×1×75 are carried out, so that a feature map of 52×52×75 is finally obtained;
the feature fusion layer 2 is a convolution module and comprises 5 steps of convolution operations, the sizes and the number of convolution kernels are 1×1×128, 3×3×256, 1×1×128, 3×3×256 and 1×1×128 in sequence, the step sizes are 1, a feature map of 26×26×128 is obtained, and then the convolution operations of 3×3×75 and 1×1×75 are carried out, so that a feature map of 26×26×75 is finally obtained;
the feature fusion layer 3 is a convolution module and comprises 5 steps of convolution operations, the sizes and the number of convolution kernels are 1×1×128, 3×3×256, 1×1×128, 3×3×256 and 1×1×128 in sequence, the step sizes are 1, a feature map of 13×13×128 is obtained, then the convolution operations of 3×3×75 and 1×1×75 are carried out, and finally a feature map of 13×13×75 is obtained;
wherein each convolution layer contains 3 operations:
the first step: performing convolution operation on the feature map matrix input into the layer;
and a second step of: carrying out batch normalization processing on the convolution result obtained in the last step, normalizing all data to be between 0 and 1 to obtain a normalized two-dimensional matrix, and being beneficial to accelerating training speed;
and a third step of: taking the normalized two-dimensional matrix obtained in the last step as the input of an activation function to obtain the final output of the layer;
the formula for the activation function is as follows:
y=x×tanh(ln(1+e x ))
wherein x is a normalized two-dimensional matrix obtained in the last step, tanh () is a hyperbolic tangent function, and y is a calculated value of each neuron after the activation function; the nonlinear characteristic activation function is introduced into the network, so that nonlinear mapping relation between input and output is ensured, and the nonlinear mapping relation is not a simple linear combination relation, so that the learning capacity of the network can be ensured;
the output of the feature extraction module is three feature matrices, the dimensions of the three feature matrices are 52×52×75, 26×26×75 and 13×13×75 respectively, wherein the receptive field of each neuron in the feature matrix of 52×52×75 is minimum, and can be responsible for detecting a small target in an original input image, and similarly, the receptive field of each neuron in the feature matrix of 13×13×75 is maximum, and can be responsible for detecting a large target in the original input image; thus, the multi-scale prediction is carried out, and the condition of missing detection of a small target can be avoided;
taking a 13×13×75 feature map as an example, the first dimension 13 represents the number of horizontal pixels in the picture, the second dimension 13 represents the number of vertical pixels in the picture, the third dimension 75 represents the feature number of the object of interest, and includes 3 scale information, where each scale information includes 25 information points, and the 25 information points are respectively 4 coordinate information t of the prediction frame xi ,t yi ,t wi ,t hi Confidence of predictionAnd class probability->Wherein category information->Wherein (t) xi 、t yi ) Coordinate parameter value representing the center point of the i-th prediction frame, (t) wi 、t hi ) Parameter values representing the width and height of the ith prediction frame, prediction confidence +.>Representing the probability that the ith prediction box contains the target, class probability +.>For multidimensional vectors, the probability that the target of the ith prediction frame is a certain category is represented; note that t xi ,t yi ,t wi ,t hi The four parameters are relative position coordinates, which need to be converted into actual coordinates in the original picture; the formula of the conversion is as follows:
wherein t is xi ,t yi ,t wi ,t hi Is the predicted relative coordinate value, p w 、p h Representing the width and height of the prediction frame corresponding to the anchor frame c x 、c y Representing the offset of the prediction frame relative to the picture's upper left angular position coordinates,representing the actual coordinates of the center point of the prediction frame,representing the actual width and the actual height of the prediction frame;
step three: training the model in a PASCAL VOC data set;
(1) Randomly initializing a weight by a network to enable the initialized weight to be subjected to Gaussian normal distribution;
(2) The input data is transmitted forwards through the network structure in the second step of the invention to obtain output values which are characteristic diagram 1, characteristic diagram 2 and characteristic diagram 3, and the information of the predicted frame is obtained by utilizing the information of the characteristic diagram;
(3) Matching the real frames marked in the data set with anchor frames obtained by clustering: calculating a center point of a real frame, screening an anchor frame corresponding to the center point, selecting the anchor frame with the maximum IoU value with the real frame as a target frame, giving coordinate value information of the real frame to the target frame to obtain the coordinate value of the target frame, setting the class value of the target frame as 1, setting the confidence value as 1, and setting the parameter values of the rest untagged anchor frames as 0;
(4) Solving error loss between the output value of the network prediction frame and the target value of the target frame by using a loss function;
the total loss function is:
loss=center_loss+size_loss+confidence_loss+cls_loss
center_loss=x_loss+y_loss
confidence_loss=obj_loss+noobj_loss
where N represents the total number of network prediction frames, l i obj Indicating whether or not there is a target in the i-th prediction frame, if so, l i obj =1, otherwise 0; (x) i ,y i ) Indicating the true center position of the ith marking frame where the target is located,representing the center position of the i-th prediction frame, (w) i ,h i ) The true width and height of the ith marking frame where the target is located,/-for>Representing the width and height of the ith prediction box, α is used to adjust the proportion of scale loss to be occupied in all losses; c (C) i Indicating that the ith marking frame where the target is located is trueConfidence level (confidence)>Representing the confidence of the ith prediction frame; p is p i Representing the probability of the class of the object in the ith annotation box where the target is located, < >>Representing class probabilities of the ith prediction frame object;
(5) Updating the weight by using an Adam optimization algorithm until the iteration number is greater than epoch, and ending training; the main test index of the method is mAP (mean Average Precision), which represents average accuracy, firstly, the average accuracy is AP (Average Precision) in one category, and then the average accuracy of all categories is averaged mAP (mean Average Precision);
step four: and comparing the classical YOLOv3 algorithm, and analyzing the test result.
2. A YOLOv3 algorithm based on an improvement of an activation function according to claim 1, step one: downloading a current target detection field general data set VOC data set, wherein the PASCAL VOC data set provides 20 object categories; the picture in the data set used in the invention is the category information p marked with the target i And the central position coordinates (x, y) of the target, the width w and the height h of the target are visualized by rectangular frames.
3. A YOLOv3 algorithm based on an improvement of an activation function according to claim 1, step four: and comparing the classical YOLOv3 algorithm, and analyzing the test result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010880785.1A CN112364974B (en) | 2020-08-28 | 2020-08-28 | YOLOv3 algorithm based on activation function improvement |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010880785.1A CN112364974B (en) | 2020-08-28 | 2020-08-28 | YOLOv3 algorithm based on activation function improvement |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112364974A CN112364974A (en) | 2021-02-12 |
CN112364974B true CN112364974B (en) | 2024-02-09 |
Family
ID=74516708
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010880785.1A Active CN112364974B (en) | 2020-08-28 | 2020-08-28 | YOLOv3 algorithm based on activation function improvement |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112364974B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112949633B (en) * | 2021-03-05 | 2022-10-21 | 中国科学院光电技术研究所 | Improved YOLOv 3-based infrared target detection method |
CN113486764B (en) * | 2021-06-30 | 2022-05-03 | 中南大学 | Pothole detection method based on improved YOLOv3 |
CN115113637A (en) * | 2022-07-13 | 2022-09-27 | 中国科学院地质与地球物理研究所 | Unmanned geophysical inspection system and method based on 5G and artificial intelligence |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109685152A (en) * | 2018-12-29 | 2019-04-26 | 北京化工大学 | A kind of image object detection method based on DC-SPP-YOLO |
WO2019144575A1 (en) * | 2018-01-24 | 2019-08-01 | 中山大学 | Fast pedestrian detection method and device |
CN111062282A (en) * | 2019-12-05 | 2020-04-24 | 武汉科技大学 | Transformer substation pointer type instrument identification method based on improved YOLOV3 model |
CN111310861A (en) * | 2020-03-27 | 2020-06-19 | 西安电子科技大学 | License plate recognition and positioning method based on deep neural network |
CN111310773A (en) * | 2020-03-27 | 2020-06-19 | 西安电子科技大学 | Efficient license plate positioning method of convolutional neural network |
-
2020
- 2020-08-28 CN CN202010880785.1A patent/CN112364974B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2019144575A1 (en) * | 2018-01-24 | 2019-08-01 | 中山大学 | Fast pedestrian detection method and device |
CN109685152A (en) * | 2018-12-29 | 2019-04-26 | 北京化工大学 | A kind of image object detection method based on DC-SPP-YOLO |
CN111062282A (en) * | 2019-12-05 | 2020-04-24 | 武汉科技大学 | Transformer substation pointer type instrument identification method based on improved YOLOV3 model |
CN111310861A (en) * | 2020-03-27 | 2020-06-19 | 西安电子科技大学 | License plate recognition and positioning method based on deep neural network |
CN111310773A (en) * | 2020-03-27 | 2020-06-19 | 西安电子科技大学 | Efficient license plate positioning method of convolutional neural network |
Non-Patent Citations (1)
Title |
---|
余永维 ; 韩鑫 ; 杜柳青 ; .基于Inception-SSD算法的零件识别.光学精密工程.2020,(第08期),全文. * |
Also Published As
Publication number | Publication date |
---|---|
CN112364974A (en) | 2021-02-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109977918B (en) | Target detection positioning optimization method based on unsupervised domain adaptation | |
CN110136154B (en) | Remote sensing image semantic segmentation method based on full convolution network and morphological processing | |
CN110334765B (en) | Remote sensing image classification method based on attention mechanism multi-scale deep learning | |
CN105678284B (en) | A kind of fixed bit human body behavior analysis method | |
CN105138973B (en) | The method and apparatus of face authentication | |
CN103605972B (en) | Non-restricted environment face verification method based on block depth neural network | |
CN111612017B (en) | Target detection method based on information enhancement | |
CN113486981B (en) | RGB image classification method based on multi-scale feature attention fusion network | |
CN112364974B (en) | YOLOv3 algorithm based on activation function improvement | |
CN114758288B (en) | Power distribution network engineering safety control detection method and device | |
CN112418212B (en) | YOLOv3 algorithm based on EIoU improvement | |
CN112464865A (en) | Facial expression recognition method based on pixel and geometric mixed features | |
CN111079683A (en) | Remote sensing image cloud and snow detection method based on convolutional neural network | |
CN110197205A (en) | A kind of image-recognizing method of multiple features source residual error network | |
CN110543906B (en) | Automatic skin recognition method based on Mask R-CNN model | |
CN110287873A (en) | Noncooperative target pose measuring method, system and terminal device based on deep neural network | |
CN107169504A (en) | A kind of hand-written character recognition method based on extension Non-linear Kernel residual error network | |
CN103714148B (en) | SAR image search method based on sparse coding classification | |
CN110716792B (en) | Target detector and construction method and application thereof | |
CN111898621A (en) | Outline shape recognition method | |
CN113469088A (en) | SAR image ship target detection method and system in passive interference scene | |
CN114998220A (en) | Tongue image detection and positioning method based on improved Tiny-YOLO v4 natural environment | |
CN111598854A (en) | Complex texture small defect segmentation method based on rich robust convolution characteristic model | |
Lin et al. | Determination of the varieties of rice kernels based on machine vision and deep learning technology | |
CN114048468A (en) | Intrusion detection method, intrusion detection model training method, device and medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |