CN112364974B - YOLOv3 algorithm based on activation function improvement - Google Patents

YOLOv3 algorithm based on activation function improvement Download PDF

Info

Publication number
CN112364974B
CN112364974B CN202010880785.1A CN202010880785A CN112364974B CN 112364974 B CN112364974 B CN 112364974B CN 202010880785 A CN202010880785 A CN 202010880785A CN 112364974 B CN112364974 B CN 112364974B
Authority
CN
China
Prior art keywords
frame
target
convolution
feature
loss
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010880785.1A
Other languages
Chinese (zh)
Other versions
CN112364974A (en
Inventor
王兰美
朱衍波
褚安亮
廖桂生
王桂宝
孙长征
贾建科
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Shaanxi University of Technology
Original Assignee
Xidian University
Shaanxi University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University, Shaanxi University of Technology filed Critical Xidian University
Priority to CN202010880785.1A priority Critical patent/CN112364974B/en
Publication of CN112364974A publication Critical patent/CN112364974A/en
Application granted granted Critical
Publication of CN112364974B publication Critical patent/CN112364974B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides an improved YOLOv3 algorithm based on an activation function, and the average detection accuracy is improved. The activation function introduces a nonlinear characteristic activation function into the network, thereby ensuring the learning ability of the network. Firstly, preparing a universal PASCAL VOC data set in the current target detection field; secondly, reconstructing an existing algorithm YOLOv3 network model, adopting an Adam algorithm as an optimization algorithm in the training process, and detecting the performance of the model; then embedding the improved version activation function into a YOLOv3 algorithm model for training and evaluating the performance; and finally, comparing the classical YOLOv3 algorithm, and analyzing the test result. Compared with the classical YOLOv3 algorithm, the modified YOLOv3 algorithm based on the activation function provided by the invention has the advantages that on a general PASCAL VOC data set, mAP indexes are improved by nearly 1%, in addition, the module does not introduce more calculation amount, and the real-time performance is not affected compared with the original model. The module can be embedded into other classical algorithm models for comparison test, has higher applicability and has higher robustness.

Description

YOLOv3 algorithm based on activation function improvement
Technical Field
The invention belongs to the field of image recognition, and particularly relates to a YOLOv3 target detection algorithm based on an improved activation function, wherein the algorithm shows good detection performance on a general standard data set PASCAL VOC.
Background
The target detection mainly comprises a traditional target detection technology and a target detection technology based on deep learning, and in recent years, along with development and intelligent popularization of technologies, the traditional target detection technology can not meet the demands of people far, and the target detection technology based on the deep learning is generated and developed rapidly and becomes a mainstream algorithm in the current target detection field.
Target detection techniques based on deep learning can be broadly divided into two types of methods, one and two: the two-stage method mainly comprises algorithms based on candidate areas, such as R-CNN, fast-R-CNN and Fast-R-CNN, wherein the algorithms firstly generate a plurality of candidate areas on a picture, and then the candidate areas are subjected to candidate frame classification and regression through a convolutional neural network. The method has the highest precision, but the detection speed is low, and the real-time requirement cannot be met; the one-stage method uses a convolution neural network to directly predict the types and positions of different targets, belongs to an end-to-end method, and mainly comprises SSD and YOLO series.
The activation function (Activation Function) is a function running on neurons of an artificial neural network and is responsible for mapping inputs of the neurons to outputs. Common activation functions are: a Sigmoid function; a Tanh function; reLU function. The improved activation function is provided and embedded into the existing classical algorithm YOLOv3 to detect the performance of the activation function, the improved model is improved by nearly 1% in precision compared with an original model, and the module does not introduce more calculation amount and has no influence on real-time performance compared with the original model. The module can be embedded into other classical algorithm models for comparison test, and has more applicability.
Disclosure of Invention
The method of the invention provides an improved YOLOv3 algorithm based on an activation function, and the detection performance of the YOLOv3 algorithm is partially improved by embedding the improved activation function.
Step one: downloading a PASCAL VOC data set of a current target detection field general data set, ensuring to keep consistent with the field general data set so as to achieve a comparison effect, and detecting the performance of the method. The download address is: https:// pjreddie.com/projects/pascal-voc-dataset-mirror/.
The paspal VOC dataset provides 20 object categories. The picture in the data set used in the invention is the category information p marked with the target i And the central position coordinates (x, y) of the target, the width w and the height h of the target are visualized by rectangular frames.
Step two: based on the improved activation function, the YOLOv3 network structure is reconstructed.
First, the initial weights of the network are randomized to follow a gaussian normal distribution, and then an RGB picture is input, which can be represented as a matrix form of a×a×3, where a is the width and height of the picture.
The input matrix is then composed of 52 convolutional layers, via the network structure built below, into three stages, i.e., three different scale outputs. Specifically, the product is represented by "x":
the size of the convolution kernel is 3 multiplied by 3, the step length is 2, and the number is 32 through the 1 st convolution layer, so that 208 multiplied by 32 feature map output is obtained; and entering a layer 2 convolution layer, wherein the convolution kernel is 3×3 in size, 1 in step length and 32 in number, so as to obtain 208×208×32 feature map output, and the like. According to different convolution kernels of each layer in the network diagram, the method respectively enters three different stages to sequentially obtain a 52×52×256 feature diagram, a 26×26×512 feature diagram and a 13×13×1024 feature diagram, and then enters feature fusion layers 1,2 and 3 to continuously perform feature fusion operations, wherein the feature fusion operations are respectively as follows:
the feature fusion layer 1 is a convolution module and comprises 5 steps of convolution operations, the convolution kernel sizes and the number are sequentially 1×1×128, 3×3×256, 1×1×128, 3×3×256 and 1×1×128, the step sizes are all 1, a feature map of 52×52×128 is obtained, and then the convolution operations of 3×3×75 and 1×1×75 are carried out, so that the feature map of 52×52×75 is finally obtained.
The feature fusion layer 2 is a convolution module and comprises 5 steps of convolution operations, the convolution kernel sizes and the number are sequentially 1×1×128, 3×3×256, 1×1×128, 3×3×256 and 1×1×128, the step sizes are all 1, a feature map of 26×26×128 is obtained, and then the convolution operations of 3×3×75 and 1×1×75 are carried out, so that a feature map of 26×26×75 is finally obtained.
The feature fusion layer 3 is a convolution module and comprises 5 steps of convolution operations, the convolution kernel sizes and the number are sequentially 1×1×128, 3×3×256, 1×1×128, 3×3×256 and 1×1×128, the step sizes are all 1, a feature map of 13×13×128 is obtained, and then the convolution operations of 3×3×75 and 1×1×75 are carried out, so that a feature map of 13×13×75 is finally obtained.
Wherein each convolution layer contains 3 operations:
the first step: convolving the feature map matrix input to the layer
And a second step of: carrying out batch normalization processing on the convolution result obtained in the last step, normalizing all data to be between 0 and 1 to obtain a normalized two-dimensional matrix, and being beneficial to accelerating training speed;
and a third step of: and taking the normalized two-dimensional matrix obtained in the last step as the input of an activation function to obtain the final output y of the layer.
The formula for the activation function is as follows:
y=x×tanh(ln(1+e x ))
where x is the normalized two-dimensional matrix obtained in the previous step, tanh () is a hyperbolic tangent function, and y is the calculated value of each neuron after the activation function. The nonlinear characteristic activation function is introduced into the network, so that nonlinear mapping relation between input and output is ensured, and the nonlinear mapping relation is not a simple linear combination relation, so that the learning capability of the network can be ensured.
The output of the feature extraction module is three feature matrices, the dimensions of the three feature matrices are 52×52×75, 26×26×75 and 13×13×75 respectively, wherein the receptive field of each neuron in the feature matrix of 52×52×75 is minimum, and can be responsible for detecting small targets in the original input image, and similarly, the receptive field of each neuron in the feature matrix of 13×13×75 is maximum, and can be responsible for detecting large targets in the original input image. Thus, the multi-scale prediction is carried out, and the condition of missing detection of a small target can be avoided.
Taking a 13×13×75 feature map as an example, the first dimension 13 represents the number of horizontal pixels in the picture, the second dimension 13 represents the number of vertical pixels in the picture, the third dimension 75 represents the feature number of the target of interest, and includes 3 scale information, where each scale information includes 25 information points, and the 25 information points are respectively 4 coordinate information t of the prediction frame xi ,t yi ,t wi ,t hi Confidence of predictionAnd class probability->Wherein category information->Wherein (t) xi 、t yi ) Coordinate parameter value representing the center point of the i-th prediction frame, (t) wi 、t hi ) Parameter values representing the width and height of the ith prediction frame, prediction confidence +.>Representing the probability that the ith prediction box contains the target, class probability +.>The probability that the target of the ith prediction box is a certain class is represented as a multidimensional vector. Note that t xi ,t yi ,t wi ,t hi These four parameters are relative position coordinates, which need to be converted to final in-situActual coordinates in the starting picture. The formula of the conversion is as follows:
wherein t is xi ,t yi ,t wi ,t hi Is the predicted relative coordinate value, p w 、p h Representing the width and height of the prediction frame corresponding to the anchor frame c x 、c y Representing the offset of the prediction frame relative to the picture's upper left angular position coordinates,representing the actual coordinates of the center point of the prediction frame, +.>Representing the actual width and the actual height of the prediction frame.
Training the improved YOLOv3 model in a PASCAL VOC data set;
(1) The network randomly initializes the weight value, and the initialized value is subjected to Gaussian normal distribution.
(2) The input data is transmitted forwards through the network structure in the second step of the invention to obtain output values which are characteristic diagram 1, characteristic diagram 2 and characteristic diagram 3, and the information of the predicted frame is obtained by utilizing the information of the characteristic diagram;
(3) Matching the real frames marked in the data set with anchor frames obtained by clustering: calculating the center point of a real frame, screening out an anchor frame corresponding to the center point, selecting the anchor frame with the maximum IoU value with the real frame as a target frame, giving coordinate value information of the real frame to the target frame to obtain the coordinate value of the target frame, setting the class value of the target frame as 1, setting the confidence value as 1, and setting the parameter values of the rest untagged anchor frames as 0.
(4) And calculating error loss between the output value of the network prediction frame and the target value of the target frame by using the loss function.
The total loss function is:
loss=center_loss+size_loss+confidence_loss+cls_loss
center_loss=x_loss+y_loss
confidence_loss=obj_loss+noobj_loss
where N represents the total number of network prediction frames, l i obj Indicating whether or not there is a destination in the i-th prediction frameThe label, if present, l i obj =1, otherwise 0; (x) i ,y i ) Indicating the true center position of the ith marking frame where the target is located,representing the center position of the i-th prediction frame, (w) i ,h i ) The true width and height of the ith marking frame where the target is located,/-for>Representing the width and height of the ith prediction box, α is used to adjust the proportion of scale loss to be occupied in all losses. C (C) i Indicating the true confidence level of the ith marking frame where the target is located,/->Representing the confidence of the ith prediction box. P is p i Representing the probability of the class of the object in the ith annotation box where the target is located, < >>Representing the class probability of the i-th prediction frame object.
(5) And updating the weight by using an Adam optimization algorithm until the iteration number is greater than epoch, and ending training. The main test index of the method is mAP (mean Average Precision), which represents average accuracy, firstly, the average accuracy is AP (Average Precision) in one category, and then the average accuracy of all the categories is averaged mAP (mean Average Precision).
Step four: and comparing the classical YOLOv3 algorithm, and analyzing the test result.
In the test process, the detection accuracy when IoU =0.5 is adopted as a measurement index of the performance of the algorithm, and if the intersection ratio between the predicted rectangular frame of a certain picture and the real rectangular frame of the picture is greater than 0.5, the algorithm is considered to be successful in detecting the picture.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following description will briefly explain the drawings needed in the embodiments or the prior art, and it is obvious that the drawings in the following description are only some embodiments of the present invention and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of the method of the present invention;
FIG. 2 is a diagram of the structure of a YOLOv3 network model;
FIG. 3 is a schematic diagram of a network training process;
FIG. 4 is a diagram of an activation function;
FIG. 5 is a graph of partial detection results of the original YOLOv3 model;
FIG. 6 is a graph comparing the results of the partial detection of original YOLOv3 with the model of the present invention;
FIG. 7 is an overall performance of the original YOLOv3 with the model of the invention on a validation data set;
Detailed Description
To make the above and other objects, features and advantages of the present invention more apparent, the following specific examples of the present invention are given together with the accompanying drawings, which are described in detail as follows:
referring to fig. 1, the implementation steps of the present invention are as follows:
step one: downloading a PASCAL VOC data set of a current target detection field general data set, ensuring to keep consistent with the field general data set so as to achieve a comparison effect, and detecting the performance of the method. The download address is: https:// pjreddie.com/projects/pascal-voc-dataset-mirror/.
The paspal VOC dataset provides 20 object categories. The picture in the data set used in the invention is the category information p marked with the target i And the central position coordinates (x, y) of the target, the width w and the height h of the target are visualized by rectangular frames.
Step two: based on the improved activation function, the YOLOv3 network structure is reconstructed.
First, the initial weights of the network are randomized to follow a gaussian normal distribution, and then an RGB picture is input, which can be represented as a matrix form of a×a×3, where a is the width and height of the picture.
Then, as shown in fig. 2, the input matrix is composed of 52 convolution layers via the network structure constructed below, and is divided into three stages, i.e., three different scale outputs. Specifically, the product is represented by "x":
the size of the convolution kernel is 3 multiplied by 3, the step length is 2, and the number is 32 through the 1 st convolution layer, so that 208 multiplied by 32 feature map output is obtained; and entering a layer 2 convolution layer, wherein the convolution kernel is 3×3 in size, 1 in step length and 32 in number, so as to obtain 208×208×32 feature map output, and the like. According to different convolution kernels of each layer in the network diagram, the method respectively enters three different stages to sequentially obtain a 52×52×256 feature diagram, a 26×26×512 feature diagram and a 13×13×1024 feature diagram, and then enters feature fusion layers 1,2 and 3 to continuously perform feature fusion operations, wherein the feature fusion operations are respectively as follows:
the feature fusion layer 1 is a convolution module and comprises 5 steps of convolution operations, the convolution kernel sizes and the number are sequentially 1×1×128, 3×3×256, 1×1×128, 3×3×256 and 1×1×128, the step sizes are all 1, a feature map of 52×52×128 is obtained, and then the convolution operations of 3×3×75 and 1×1×75 are carried out, so that the feature map of 52×52×75 is finally obtained.
The feature fusion layer 2 is a convolution module and comprises 5 steps of convolution operations, the convolution kernel sizes and the number are sequentially 1×1×128, 3×3×256, 1×1×128, 3×3×256 and 1×1×128, the step sizes are all 1, a feature map of 26×26×128 is obtained, and then the convolution operations of 3×3×75 and 1×1×75 are carried out, so that a feature map of 26×26×75 is finally obtained.
The feature fusion layer 3 is a convolution module and comprises 5 steps of convolution operations, the convolution kernel sizes and the number are sequentially 1×1×128, 3×3×256, 1×1×128, 3×3×256 and 1×1×128, the step sizes are all 1, a feature map of 13×13×128 is obtained, and then the convolution operations of 3×3×75 and 1×1×75 are carried out, so that a feature map of 13×13×75 is finally obtained.
Wherein each convolution layer contains 3 operations:
the first step: convolving the feature map matrix input to the layer
And a second step of: carrying out batch normalization processing on the convolution result obtained in the last step, normalizing all data to be between 0 and 1 to obtain a normalized two-dimensional matrix, and being beneficial to accelerating training speed;
and a third step of: and taking the normalized two-dimensional matrix obtained in the last step as the input of an activation function to obtain the final output of the layer.
As shown in fig. 4, the formula for the activation function is as follows:
y=x×tanh(ln(1+e x ))
where x is the normalized two-dimensional matrix obtained in the previous step, tanh () is a hyperbolic tangent function, and y is the calculated value of each neuron after the activation function. The nonlinear characteristic activation function is introduced into the network, so that nonlinear mapping relation between input and output is ensured, and the nonlinear mapping relation is not a simple linear combination relation, so that the learning capability of the network can be ensured.
The output of the feature extraction module is three feature matrices, the dimensions of the three feature matrices are 52×52×75, 26×26×75 and 13×13×75 respectively, wherein the receptive field of each neuron in the feature matrix of 52×52×75 is minimum, and can be responsible for detecting small targets in the original input image, and similarly, the receptive field of each neuron in the feature matrix of 13×13×75 is maximum, and can be responsible for detecting large targets in the original input image. Thus, the multi-scale prediction is carried out, and the condition of missing detection of a small target can be avoided.
Taking a 13×13×75 feature map as an example, the first dimension 13 represents the number of horizontal pixels in the picture, the second dimension 13 represents the number of vertical pixels in the picture, the third dimension 75 represents the feature number of the object of interest, and includes 3 scale information, where each scale information includes 25 information points, and the 25 information points are respectively 4 coordinate information t of the prediction frame xi ,t yi ,t wi ,t hi Confidence of predictionAnd class probability->Wherein category information->Wherein (t) xi 、t yi ) Coordinate parameter value representing the center point of the i-th prediction frame, (t) wi 、t hi ) Parameter values representing the width and height of the ith prediction frame, prediction confidence +.>Representing the probability that the ith prediction box contains the target, class probability +.>The probability that the target of the ith prediction box is a certain class is represented as a multidimensional vector. Note that t xi ,t yi ,t wi ,t hi These four parameters are relative position coordinates that need to be translated into actual coordinates that are ultimately in the original picture. The formula of the conversion is as follows:
wherein t is xi ,t yi ,t wi ,t hi Is the predicted relative coordinate value, p w 、p h Representing the width and height of the prediction frame corresponding to the anchor frame c x 、c y Representing the offset of the prediction frame relative to the picture's upper left angular position coordinates,representing the actual coordinates of the center point of the prediction frame, +.>Representing the actual width and the actual height of the prediction frame.
Training the model in the PASCAL VOC data set as shown in fig. 3;
(1) The network randomly initializes the weight value, and the initialized value is subjected to Gaussian normal distribution.
(2) The input data is transmitted forwards through the network structure in the second step of the invention to obtain output values which are characteristic diagram 1, characteristic diagram 2 and characteristic diagram 3, and the information of the predicted frame is obtained by utilizing the information of the characteristic diagram;
(3) Matching the real frames marked in the data set with anchor frames obtained by clustering: calculating the center point of a real frame, screening out an anchor frame corresponding to the center point, selecting the anchor frame with the maximum IoU value with the real frame as a target frame, giving coordinate value information of the real frame to the target frame to obtain the coordinate value of the target frame, setting the class value of the target frame as 1, setting the confidence value as 1, and setting the parameter values of the rest untagged anchor frames as 0.
(4) And calculating error loss between the output value of the network prediction frame and the target value of the target frame by using the loss function.
The total loss function is:
loss=center_loss+size_loss+confidence_loss+cls_loss
center_loss=x_loss+y_loss
confidence_loss=obj_loss+noobj_loss
where N represents the total number of network prediction frames, l i obj Indicating whether or not there is a target in the i-th prediction frame, if so, l i obj =1, otherwise 0; (x) i ,y i ) Indicating the true center position of the ith marking frame where the target is located,representing the center position of the i-th prediction frame, (w) i ,h i ) The true width and height of the ith marking frame where the target is located,/-for>Representing the width and height of the ith prediction box, α is used to adjust the proportion of scale loss to be occupied in all losses. C (C) i Indicating the true confidence level of the ith marking frame where the target is located,/->Representing the confidence of the ith prediction box. P is p i Representing the probability of the class of the object in the ith annotation box where the target is located, < >>Representing the class probability of the i-th prediction frame object.
(5) And updating the weight by using an Adam optimization algorithm until the iteration number is greater than epoch, and ending training. The main test index of the method is mAP (mean Average Precision), which represents average accuracy, firstly, the average accuracy is AP (Average Precision) in one category, and then the average accuracy of all the categories is averaged mAP (mean Average Precision).
Step four: and comparing the classical YOLOv3 algorithm, and analyzing the test result.
In the test process, the detection accuracy when IoU =0.5 is adopted as a measurement index of the performance of the algorithm, and if the intersection ratio between the predicted rectangular frame of a certain picture and the real rectangular frame of the picture is greater than 0.5, the algorithm is considered to be successful in detecting the picture.
The invention is further described below in connection with simulation examples.
Simulation example:
the invention adopts the original YOLOv3 model as a comparison model, adopts the PASCAL VOC data set as a training set and a testing set and gives out a partial detection effect graph.
Fig. 5 is a partial detection result diagram of the original YOLOv3 model, and pictures with different backgrounds, different categories and different sizes are selected, so that it can be seen that the basic category detection effect of the object in the pictures is good.
FIG. 6 is a graph comparing the results of the original YOLOv3 and the model part of the present invention, wherein the left side is a graph showing the results of the original YOLOv3 model, and the right side is a graph showing the results of the model of the present invention.
For the left row, it can be seen that the detection effect is not very good under the conditions that the target categories are similar, the targets are overlapped and the targets are densely distributed in the picture by the original YOLOv3 model, as in fig. 6, three sheep are detected as cattle by mistake; two bear, but only one detected; the apples were densely distributed but only two were shown to be detected.
Comparing the left row, it can be seen that, for the three cases mentioned above, after the improvement of the method of the present invention, three shaeps are successfully detected for the case of similar target class; for the situation of overlapping targets, two bars can be accurately detected; for the condition of dense distribution of targets, more accurate detection can be given, and recall rate is ensured.
FIG. 7 is an illustration of the overall performance of the original YOLOv3 and model of the invention on a validation dataset, AP being flat for each class Average accuracy, mAP is the average accuracy of all classes, it can be seen that the model of the invention is on the verification set compared with the original YOLOv3 model The average accuracy mAP is higher.
As proved by a comprehensive simulation experiment, the improved model has more excellent detection precision compared with the YOLOv3, the improved model is improved by nearly 1% in precision, and in addition, the module does not introduce more calculation amount, and compared with the original model, the real-time performance is not affected. The module can be embedded into other classical algorithm models for comparison test, and has more applicability.
The present invention is not limited to the above-mentioned embodiments, but is not limited to the above-mentioned embodiments, and any simple modification, equivalent changes and modification made to the above-mentioned embodiments according to the technical matters of the present invention can be made by those skilled in the art without departing from the scope of the present invention.

Claims (3)

1. A YOLOv3 algorithm based on an activation function improvement, comprising the steps of:
step one: downloading a PASCAL VOC data set of a current target detection field general data set, ensuring that the PASCAL VOC data set is consistent with the field general data set so as to achieve a comparison effect, and detecting the performance of the method;
step two: reconstructing the YOLOv3 network structure based on the improved activation function;
firstly, randomizing initial weights of a network to enable the initial weights to follow Gaussian normal distribution, and then inputting an RGB picture which can be expressed as a matrix form of a multiplied by 3, wherein a is the width and the height of the picture;
the input matrix is then divided into three phases, i.e. three different scale outputs, by way of a network structure built below, consisting of 52 convolutional layers; specifically, the product is represented by "x":
the size of the convolution kernel is 3 multiplied by 3, the step length is 2, and the number is 32 through the 1 st convolution layer, so that 208 multiplied by 32 feature map output is obtained; entering a 2 nd convolution layer, wherein the convolution kernel size is 3 multiplied by 3, the step length is 1, the number is 32, and the output of the feature map of 208 multiplied by 32 is obtained, so that the method is similar to the method; according to different convolution kernels of each layer in the network diagram, the method respectively enters three different stages to sequentially obtain a 52×52×256 feature diagram, a 26×26×512 feature diagram and a 13×13×1024 feature diagram, and then enters feature fusion layers 1,2 and 3 to continuously perform feature fusion operations, wherein the feature fusion operations are respectively as follows:
the feature fusion layer 1 is a convolution module and comprises 5 steps of convolution operations, the sizes and the number of convolution kernels are 1×1×128, 3×3×256, 1×1×128, 3×3×256 and 1×1×128 in sequence, the step sizes are 1, a feature map of 52×52×128 is obtained, and then the convolution operations of 3×3×75 and 1×1×75 are carried out, so that a feature map of 52×52×75 is finally obtained;
the feature fusion layer 2 is a convolution module and comprises 5 steps of convolution operations, the sizes and the number of convolution kernels are 1×1×128, 3×3×256, 1×1×128, 3×3×256 and 1×1×128 in sequence, the step sizes are 1, a feature map of 26×26×128 is obtained, and then the convolution operations of 3×3×75 and 1×1×75 are carried out, so that a feature map of 26×26×75 is finally obtained;
the feature fusion layer 3 is a convolution module and comprises 5 steps of convolution operations, the sizes and the number of convolution kernels are 1×1×128, 3×3×256, 1×1×128, 3×3×256 and 1×1×128 in sequence, the step sizes are 1, a feature map of 13×13×128 is obtained, then the convolution operations of 3×3×75 and 1×1×75 are carried out, and finally a feature map of 13×13×75 is obtained;
wherein each convolution layer contains 3 operations:
the first step: performing convolution operation on the feature map matrix input into the layer;
and a second step of: carrying out batch normalization processing on the convolution result obtained in the last step, normalizing all data to be between 0 and 1 to obtain a normalized two-dimensional matrix, and being beneficial to accelerating training speed;
and a third step of: taking the normalized two-dimensional matrix obtained in the last step as the input of an activation function to obtain the final output of the layer;
the formula for the activation function is as follows:
y=x×tanh(ln(1+e x ))
wherein x is a normalized two-dimensional matrix obtained in the last step, tanh () is a hyperbolic tangent function, and y is a calculated value of each neuron after the activation function; the nonlinear characteristic activation function is introduced into the network, so that nonlinear mapping relation between input and output is ensured, and the nonlinear mapping relation is not a simple linear combination relation, so that the learning capacity of the network can be ensured;
the output of the feature extraction module is three feature matrices, the dimensions of the three feature matrices are 52×52×75, 26×26×75 and 13×13×75 respectively, wherein the receptive field of each neuron in the feature matrix of 52×52×75 is minimum, and can be responsible for detecting a small target in an original input image, and similarly, the receptive field of each neuron in the feature matrix of 13×13×75 is maximum, and can be responsible for detecting a large target in the original input image; thus, the multi-scale prediction is carried out, and the condition of missing detection of a small target can be avoided;
taking a 13×13×75 feature map as an example, the first dimension 13 represents the number of horizontal pixels in the picture, the second dimension 13 represents the number of vertical pixels in the picture, the third dimension 75 represents the feature number of the object of interest, and includes 3 scale information, where each scale information includes 25 information points, and the 25 information points are respectively 4 coordinate information t of the prediction frame xi ,t yi ,t wi ,t hi Confidence of predictionAnd class probability->Wherein category information->Wherein (t) xi 、t yi ) Coordinate parameter value representing the center point of the i-th prediction frame, (t) wi 、t hi ) Parameter values representing the width and height of the ith prediction frame, prediction confidence +.>Representing the probability that the ith prediction box contains the target, class probability +.>For multidimensional vectors, the probability that the target of the ith prediction frame is a certain category is represented; note that t xi ,t yi ,t wi ,t hi The four parameters are relative position coordinates, which need to be converted into actual coordinates in the original picture; the formula of the conversion is as follows:
wherein t is xi ,t yi ,t wi ,t hi Is the predicted relative coordinate value, p w 、p h Representing the width and height of the prediction frame corresponding to the anchor frame c x 、c y Representing the offset of the prediction frame relative to the picture's upper left angular position coordinates,representing the actual coordinates of the center point of the prediction frame,representing the actual width and the actual height of the prediction frame;
step three: training the model in a PASCAL VOC data set;
(1) Randomly initializing a weight by a network to enable the initialized weight to be subjected to Gaussian normal distribution;
(2) The input data is transmitted forwards through the network structure in the second step of the invention to obtain output values which are characteristic diagram 1, characteristic diagram 2 and characteristic diagram 3, and the information of the predicted frame is obtained by utilizing the information of the characteristic diagram;
(3) Matching the real frames marked in the data set with anchor frames obtained by clustering: calculating a center point of a real frame, screening an anchor frame corresponding to the center point, selecting the anchor frame with the maximum IoU value with the real frame as a target frame, giving coordinate value information of the real frame to the target frame to obtain the coordinate value of the target frame, setting the class value of the target frame as 1, setting the confidence value as 1, and setting the parameter values of the rest untagged anchor frames as 0;
(4) Solving error loss between the output value of the network prediction frame and the target value of the target frame by using a loss function;
the total loss function is:
loss=center_loss+size_loss+confidence_loss+cls_loss
center_loss=x_loss+y_loss
confidence_loss=obj_loss+noobj_loss
where N represents the total number of network prediction frames, l i obj Indicating whether or not there is a target in the i-th prediction frame, if so, l i obj =1, otherwise 0; (x) i ,y i ) Indicating the true center position of the ith marking frame where the target is located,representing the center position of the i-th prediction frame, (w) i ,h i ) The true width and height of the ith marking frame where the target is located,/-for>Representing the width and height of the ith prediction box, α is used to adjust the proportion of scale loss to be occupied in all losses; c (C) i Indicating that the ith marking frame where the target is located is trueConfidence level (confidence)>Representing the confidence of the ith prediction frame; p is p i Representing the probability of the class of the object in the ith annotation box where the target is located, < >>Representing class probabilities of the ith prediction frame object;
(5) Updating the weight by using an Adam optimization algorithm until the iteration number is greater than epoch, and ending training; the main test index of the method is mAP (mean Average Precision), which represents average accuracy, firstly, the average accuracy is AP (Average Precision) in one category, and then the average accuracy of all categories is averaged mAP (mean Average Precision);
step four: and comparing the classical YOLOv3 algorithm, and analyzing the test result.
2. A YOLOv3 algorithm based on an improvement of an activation function according to claim 1, step one: downloading a current target detection field general data set VOC data set, wherein the PASCAL VOC data set provides 20 object categories; the picture in the data set used in the invention is the category information p marked with the target i And the central position coordinates (x, y) of the target, the width w and the height h of the target are visualized by rectangular frames.
3. A YOLOv3 algorithm based on an improvement of an activation function according to claim 1, step four: and comparing the classical YOLOv3 algorithm, and analyzing the test result.
CN202010880785.1A 2020-08-28 2020-08-28 YOLOv3 algorithm based on activation function improvement Active CN112364974B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010880785.1A CN112364974B (en) 2020-08-28 2020-08-28 YOLOv3 algorithm based on activation function improvement

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010880785.1A CN112364974B (en) 2020-08-28 2020-08-28 YOLOv3 algorithm based on activation function improvement

Publications (2)

Publication Number Publication Date
CN112364974A CN112364974A (en) 2021-02-12
CN112364974B true CN112364974B (en) 2024-02-09

Family

ID=74516708

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010880785.1A Active CN112364974B (en) 2020-08-28 2020-08-28 YOLOv3 algorithm based on activation function improvement

Country Status (1)

Country Link
CN (1) CN112364974B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112949633B (en) * 2021-03-05 2022-10-21 中国科学院光电技术研究所 Improved YOLOv 3-based infrared target detection method
CN113486764B (en) * 2021-06-30 2022-05-03 中南大学 Pothole detection method based on improved YOLOv3
CN115113637A (en) * 2022-07-13 2022-09-27 中国科学院地质与地球物理研究所 Unmanned geophysical inspection system and method based on 5G and artificial intelligence

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109685152A (en) * 2018-12-29 2019-04-26 北京化工大学 A kind of image object detection method based on DC-SPP-YOLO
WO2019144575A1 (en) * 2018-01-24 2019-08-01 中山大学 Fast pedestrian detection method and device
CN111062282A (en) * 2019-12-05 2020-04-24 武汉科技大学 Transformer substation pointer type instrument identification method based on improved YOLOV3 model
CN111310861A (en) * 2020-03-27 2020-06-19 西安电子科技大学 License plate recognition and positioning method based on deep neural network
CN111310773A (en) * 2020-03-27 2020-06-19 西安电子科技大学 Efficient license plate positioning method of convolutional neural network

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019144575A1 (en) * 2018-01-24 2019-08-01 中山大学 Fast pedestrian detection method and device
CN109685152A (en) * 2018-12-29 2019-04-26 北京化工大学 A kind of image object detection method based on DC-SPP-YOLO
CN111062282A (en) * 2019-12-05 2020-04-24 武汉科技大学 Transformer substation pointer type instrument identification method based on improved YOLOV3 model
CN111310861A (en) * 2020-03-27 2020-06-19 西安电子科技大学 License plate recognition and positioning method based on deep neural network
CN111310773A (en) * 2020-03-27 2020-06-19 西安电子科技大学 Efficient license plate positioning method of convolutional neural network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
余永维 ; 韩鑫 ; 杜柳青 ; .基于Inception-SSD算法的零件识别.光学精密工程.2020,(第08期),全文. *

Also Published As

Publication number Publication date
CN112364974A (en) 2021-02-12

Similar Documents

Publication Publication Date Title
CN109977918B (en) Target detection positioning optimization method based on unsupervised domain adaptation
CN110136154B (en) Remote sensing image semantic segmentation method based on full convolution network and morphological processing
CN110334765B (en) Remote sensing image classification method based on attention mechanism multi-scale deep learning
CN105678284B (en) A kind of fixed bit human body behavior analysis method
CN105138973B (en) The method and apparatus of face authentication
CN103605972B (en) Non-restricted environment face verification method based on block depth neural network
CN111612017B (en) Target detection method based on information enhancement
CN113486981B (en) RGB image classification method based on multi-scale feature attention fusion network
CN112364974B (en) YOLOv3 algorithm based on activation function improvement
CN114758288B (en) Power distribution network engineering safety control detection method and device
CN112418212B (en) YOLOv3 algorithm based on EIoU improvement
CN112464865A (en) Facial expression recognition method based on pixel and geometric mixed features
CN111079683A (en) Remote sensing image cloud and snow detection method based on convolutional neural network
CN110197205A (en) A kind of image-recognizing method of multiple features source residual error network
CN110543906B (en) Automatic skin recognition method based on Mask R-CNN model
CN110287873A (en) Noncooperative target pose measuring method, system and terminal device based on deep neural network
CN107169504A (en) A kind of hand-written character recognition method based on extension Non-linear Kernel residual error network
CN103714148B (en) SAR image search method based on sparse coding classification
CN110716792B (en) Target detector and construction method and application thereof
CN111898621A (en) Outline shape recognition method
CN113469088A (en) SAR image ship target detection method and system in passive interference scene
CN114998220A (en) Tongue image detection and positioning method based on improved Tiny-YOLO v4 natural environment
CN111598854A (en) Complex texture small defect segmentation method based on rich robust convolution characteristic model
Lin et al. Determination of the varieties of rice kernels based on machine vision and deep learning technology
CN114048468A (en) Intrusion detection method, intrusion detection model training method, device and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant