CN111046787A

CN111046787A - Pedestrian detection method based on improved YOLO v3 model

Info

Publication number: CN111046787A
Application number: CN201911257993.XA
Authority: CN
Inventors: 陈健; 黄德天
Original assignee: Huaqiao University
Current assignee: Huaqiao University
Priority date: 2019-12-10
Filing date: 2019-12-10
Publication date: 2020-04-21

Abstract

The invention provides a pedestrian detection method based on an improved YOLO v3 model, which comprises the following steps: selecting a training sample; carrying out K-means value clustering calculation on the sample to obtain a new anchors value, and replacing the data set parameter in the original YOLO v3 model with the new anchors value; introducing an acceptance module, and performing cutting optimization on the acceptance module to obtain an improved YOLO v3 model; detecting the pedestrian by using the improved YOLO v3 model to obtain a detection result; the problem that the characteristics extracted by the original YOLO v3 model are too single is solved, and the pedestrian detection precision is improved.

Description

Pedestrian detection method based on improved YOLO v3 model

Technical Field

The invention relates to a pedestrian detection method based on a neural network, in particular to a pedestrian detection method based on an improved YOLOv3 model.

Background

Pedestrian detects the branch in the target detection field, can embody from many aspects to pedestrian detection technology's urgent need, like wisdom traffic, security protection video monitoring, autopilot technique etc.. In the early days, due to the limitation of computer hardware conditions, the form of pedestrian detection is mainly based on images, and only the requirement of detecting whether pedestrians exist in the images is met. Nowadays, with the rapid development of microelectronic technology and computer technology, the technology is required to be capable of detecting pedestrians in a simple background environment, and also required to be capable of accurately detecting pedestrians even though strong interference factors exist in an external environment, such as a strong light environment, a weak light environment, shielding and the like; meanwhile, the detection form is not limited to images any more, real-time detection is required, and functions such as tracking, behavior recognition and the like are required to be added on the basis of detection. In addition, with the rapid development of deep learning in recent years, more and more deep learning models are beginning to be widely applied to various technologies of computer vision. For example, various related technologies ranging from ubiquitous license plate recognition, pedestrian detection, to advanced driver assistance. Compared with the traditional pedestrian detection method, the pedestrian detection method based on the convolutional neural network greatly improves the detection precision and speed; however, the existing YOLO v3 model extracts too single features, so that the accuracy in recognition is not high.

Disclosure of Invention

The invention aims to provide a pedestrian detection method based on an improved YOLO v3 model, which solves the problem that the characteristics extracted by the original YOLO v3 model are too single and improves the pedestrian detection precision.

In a first aspect, the present invention provides a pedestrian detection method based on an improved YOLO v3 model, including:

step 1, selecting a training sample;

step 2, carrying out K-means value clustering calculation on the sample to obtain a new anchors value, and replacing the data set parameter in the original YOLO v3 model with the new anchors value;

step 3, introducing an acceptance module, and performing cutting optimization on the acceptance module to obtain an improved YOLO v3 model;

and 4, detecting the pedestrian by using the improved YOLO v3 model to obtain a detection result.

2. The pedestrian detection method based on the improved YOLO v3 model of claim 1, wherein: the step 1 is further specifically as follows: pedestrian images in the public data sets pascal voc2007 and pascalloc 2012 are respectively extracted from the public data sets, and training samples are selected according to the proportion that the training set and the testing set are 2: 1.

Further, the step 2 is further specifically:

using a non-linear mapping of θ to convert the sample x_i(i ═ 1,2, …, l) is mapped into the high-dimensional space G, i.e. the samples are θ (x)₁),θ(x₂),...,θ(x_i)；

Performing K-means clustering operation in high-dimensional space to optimize function

Wherein, the sample mean value m_kThis can be derived from the following formula:

in the nuclear space, the nuclear distance of two characteristic points is calculated

Where N is a kernel function.

Merging all the sample subsets obtained by clustering, wherein the merged set of the sample subsets comprises K target categories, and calculating the mean values of the K target categories respectively

Wherein n is_iData volume, x, representing the category_iRepresents the mean of the i-th class.

Calculating the distance between any two class means

I＝|x_i-x_j|²(5)

If the distance between the mean values of the two target categories is smaller than a preset threshold value, combining the two target categories into one category; and then continuing to calculate the mean-like distance by the formula (5). Merging the union sets of the sample subsets to obtain a final clustering result;

and calculating the anchors values which are consistent with the model by using the finally generated clustering results, and replacing the data set parameters in the original YOLOv3 model by the new anchors values.

Further, the step 3 is further specifically: introducing an acceptance module, then cutting and optimizing the acceptance module, wherein the cut acceptance module is mainly formed by combining a 3 × 3 convolution layer and a 5 × 5 convolution layer, meanwhile, for the 5 × 5 convolution layer, continuously using two 3 × 3 convolution layers to replace the convolution layer, merging two paths of output results of different receptive fields by using a route layer in a YOLOv3 model, combining the output results into one output layer, and transmitting the output layer to the next convolution network for further feature extraction operation; and putting the clipped initiation module into the original YOLOv3 model to obtain an improved YOLOv3 model.

Further, the step 4 is further specifically:

step 4 a: partitioning an image to be detected; when a model is input, the size of an image is adjusted in a self-adaptive mode, the image is adjusted to be square, and then grids with the size of N x N are used for blocking;

and 4 b: when the center point of a certain target exists in the blocked grid, the grid is responsible for carrying out classification judgment and position detection on the target, and the following operations are carried out:

when the central point of a certain target falls into N-N grids which are divided, the grids generate B prediction frames to detect the target, namely each grid has B boundary frames which are generated by the predictions of anchors and confidence coefficient CS which indicates whether the grid contains the target or not, so that the possibility of the target existing in the boundary frame based on the current model and the accuracy of the predicted target position are comprehensively reflected

Wherein Pr (object) indicates whether the center point of the object is contained in the mesh, and if so, is 1; on the contrary, the number is 0,

the intersection ratio is used for representing the intersection ratio of the bounding box generated by grid prediction and the real bounding box area of the object;

generating B predicted boundary boxes for each grid, and detecting the target in the grid, wherein each boundary prediction box comprises5 parameters [ x, y, w, h, confidence [ ]]，[x,y]Represents the coordinates of the center point of the target within the grid, [ w, h ]]Representing the width and height of the predicted boundary box, while confidence represents the intersection ratio of the predicted boundary box and the real boundary box of the object, and each grid corresponds to a predicted value C for predicting whether a certain type of target condition is contained_iThe expression is as follows,

C_i＝Pr(Class_i|Object) (7)

and 4 c: each grid obtained in step 4b contains 5 parameters, using vector y_iSpecifically, the following is shown:

y_i＝[b_x,b_y,b_w,b_h,c](8)

wherein (b)_x,b_y) Coordinates representing the center point of the object, (b)_w,b_h) Representing the width and height of a bounding box generated by the network for the target prediction, and c representing the total confidence score of the prediction box;

and 4 d: and after the N-by-N grids are predicted, sorting and summarizing the parameters of all the grids, and outputting the detection result of the whole image.

One or more technical solutions provided in the embodiments of the present invention have at least the following technical effects or advantages:

the embodiment of the application provides an improved pedestrian detection method based on improved YOLOv 3. Firstly, a model training data set is manufactured, and a new anchors value is calculated through K value clustering to replace the original YOLOv3 data set parameter; then, improving the YOLOv3 model by blending different sizes of receptive fields, and cutting the size of the receptive fields of the improved model, namely introducing a tiny-interception module to further enrich the extracted characteristics and reduce the number of network layers; and finally, training by adopting the improved YOLOv3 model, and applying the trained model to a pedestrian detection scene to realize pedestrian detection. The method provided by the invention improves the accuracy and robustness of the pedestrian detection algorithm and obtains better detection effect in the aspects of subjective vision and objective evaluation indexes.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

The invention will be further described with reference to the following examples with reference to the accompanying drawings.

FIG. 1 is a schematic block diagram of the system of the present invention;

fig. 2 is a schematic diagram of the mini network of the present invention replacing 5 × 5 convolutions.

FIG. 3a is a graph I of the original model test result.

FIG. 3b is a graph I of the detection result of the model of the present invention.

FIG. 4a is a graph II of the results of the original model test.

FIG. 4b is a second graph of the test results of the model of the present invention.

FIG. 5a is a third diagram of the test results of the original model.

FIG. 5b is a third graph of the detection result of the model of the present invention.

FIG. 6a is a diagram of the test results of the original model.

FIG. 6b is a graph IV of the test results of the model of the present invention.

Detailed Description

As shown in fig. 1, the present invention provides a pedestrian detection method based on an improved YOLO v3 model, comprising:

step 1, selecting a training sample;

The step 1 is further specifically as follows: pedestrian images in the public data sets pascal voc2007 and pascal voc2012 are respectively extracted from the public data sets, and training samples are selected according to the proportion that the training set to the testing set is 2: 1.

The step 2 is further specifically as follows:

Where N is a kernel function.

Calculating the distance between any two class means

I＝|x_i-x_j|²(5)

The step 3 is further specifically as follows: introducing an acceptance module, then cutting and optimizing the acceptance module, wherein the cut acceptance module is mainly formed by combining a 3 × 3 convolution layer and a 5 × 5 convolution layer, meanwhile, for the 5 × 5 convolution layer, continuously using two 3 × 3 convolution layers to replace the convolution layer, merging output results of two different sense fields by using a route layer in a YOLO v3 model (the two different sense fields are the independent 3 × 3 convolution layer and the improved two 3 × 3 convolution layers), and combining the output results into one output layer to be transmitted to the next convolution network for further feature extraction operation; and putting the clipped initiation module into the original YOLO v3 model to obtain an improved YOLO v3 model.

The step 4 is further specifically as follows:

b predicted boundary boxes are generated by each grid to detect the target in the grid, wherein each boundary prediction box comprises 5 parameters [ x, y, w, h, confidence ]]，[x,y]Represents the coordinates of the center point of the target within the grid, [ w, h ]]Representing the width and height of the predicted boundary box, while confidence represents the intersection ratio of the predicted boundary box and the real boundary box of the object, and each grid corresponds to a predicted value C for predicting whether a certain type of target condition is contained_iThe expression is as follows,

C_i＝Pr(Class_i|Object) (7)

y_i＝[b_x,b_y,b_w,b_h,c](8)

One specific embodiment of the present invention:

in order to further explain the technical scheme of the invention, the invention is explained in detail by the specific embodiment.

Inputting: the image containing the pedestrian to be recognized is a sample image containing the pedestrian for learning.

1. Sample data preparation

Pedestrian images in the public data sets pascalloc 2007 and pascalloc 2012 are respectively extracted from the public data sets, and 2094 images are extracted to train the data sets: test set 2:1, respectively recording the training set and the test set as

And

wherein TR represents a training set, TE represents a test set, x represents an input sample image, y represents a label corresponding to the sample image, and N represents the number of data sets.

2. Performing clustering calculation on the K-means sample to obtain a new anchors value

Reading the width and height of the pedestrian as data to be classified according to the TR training set in the step 1, initializing a clustering center point, wherein the coordinate of the clustering center point describes the width and height of a rectangular frame, calculating the IOU value of the clustering center point and the rectangular frame described by the data to be classified respectively, and finally obtaining 9 groups of anchors by taking the distance of 1-IOU as a classification basis, wherein the coordinates of the prediction center point, the width and height of an anchor frame and the prediction target class are included.

Where S represents the area of the rectangular frame.

3 improving the original YOLO v3 network and introducing a tiny-interception module

3.1 introduction of multiple size convolution kernels

An initiation module is introduced into the original YOLO v3 framework to train the set TR to construct new model data.

(1) The input and output relationship is

Wherein X is an abstract feature or a hierarchical feature extracted from an input sample,

for the feature extraction function, x is the input image, W is the convolution kernel, b is the offset value, Y is the predicted value of the sample image, and θ is the softmax logistic regression parameter。

(2) The classifier parameters are

Where T is the classifier coefficient and P represents the probability that the sample prediction result is k.

The final output class is y ═ argmax { y (k) } (4)

(3) Constructing an objective function by using cross entropy on a training set

Where R (W) and R (θ) are regularization terms for thinning out parameters and preventing overfitting, λ₁And λ₂Representing sparse coefficients.

(4) Optimization formula after updating parameters

Wherein J (W, b; theta) represents the objective function constructed by using the cross entropy in (3). The parameters of the above formula are consistent with the classifier.

3.2 clipping acceptance Module

In the convolutional layer, one 5 × 5 convolutional kernel is replaced by two 3 × 3 convolutional kernels in succession. The calculation resource consumption is large by using the convolution kernel of 5 × 5, 25/9 times of calculation consumption is needed compared with the convolution kernel of 3 × 3, and the calculation amount saved after improvement is 1- (9+ 9)/25-7/25-28%. As shown in fig. 2, the improved acceptance module combines two different reception fields, and then combines the two inputs to form input data of the next layer by using the YOLO v3 network route layer.

4 pedestrian detection

4.1 image adaptive adjustment and partitioning

The size of the input image with any size is firstly adjusted to 224X 224 in an adaptive mode, and then the image is divided by using a grid with the size of N X N to be used as a new input image

The image resolution of the input image domain is 224 x 224 and the output image y e 0,1,2, …,1396]。

4.2 pedestrian prediction

After the image to be recognized passes through an improved LOYO v3 network to complete the prediction of N-by-N grids, the position of the target pedestrian is judged according to the network output result, the coordinate value and the confidence score of the rectangular frame formed by the target pedestrian are output, finally, the parameters of all the grids are sorted and summarized, and the detection result of the whole image is output.

5 simulation experiment

The effects of the present invention can be further illustrated by the following simulation experiments. In the experiment, in order to ensure the objectivity of the experiment, images are derived from various pedestrian images extracted from public data sets pascal voc2007 and pascal voc2012, wherein 1396 images are taken as training samples and include various images containing people, and the rest 698 images are taken as test sample sets. The experiment will be compared to the original YOLO v3 model algorithm.

In order to quantitatively evaluate the superiority of the improved algorithm in performance, the experiment was trained and tested using the same data set, and finally the test results were analyzed.

TABLE-comprehensive Property test results

From the data in table one, it can be seen that the improved algorithm has a small improvement in accuracy, recall, and average IOU values over the original YOLO v3 model.

TABLE two AP value to compare

The pedestrian samples in the test set are tested, and the comparison result of the AP (average precision) values of the pedestrian categories is shown in the table two. An improved network architecture may be demonstrated with improved detection performance compared to the original network. In addition, when the improved model is used for detecting the pedestrian image, as shown in fig. 3a to 6b, the improved model is improved to a certain extent in the aspects of the integrity, the accuracy, the missing detection condition, the false detection condition and the like of the prediction frame. The improved YOLO model uses the anchor scale aiming at the pedestrian data set (passacal voc2007+ passacal voc2012), and in addition, richer layer structures are obtained by applying different scale feature extraction methods, test results show that the precision ratio and the recall ratio respectively reach 79% and 74%, and are slightly improved compared with the original YOLO v3, and meanwhile, the new model has the performance improvement of 1.72% aiming at the AP value of pedestrians.

Although specific embodiments of the invention have been described above, it will be understood by those skilled in the art that the specific embodiments described are illustrative only and are not limiting upon the scope of the invention, and that equivalent modifications and variations can be made by those skilled in the art without departing from the spirit of the invention, which is to be limited only by the appended claims.

Claims

1. A pedestrian detection method based on an improved YOLO v3 model is characterized by comprising the following steps: the method comprises the following steps:

step 1, selecting a training sample;

2. The pedestrian detection method based on the improved YOLO v3 model of claim 1, wherein: the step 1 is further specifically as follows: pedestrian images in the public data sets pascal voc2007 and pascal voc2012 are respectively extracted from the public data sets, and training samples are selected according to the proportion that the training set to the testing set is 2: 1.

3. The pedestrian detection method based on the improved YOLO v3 model of claim 1, wherein: the step 2 is further specifically as follows:

Where N is a kernel function.

Calculating the distance between any two class means

I＝|x_i-x_j|²(5)

and calculating the anchors values which are consistent with the model by using the finally generated clustering results, and replacing the data set parameters in the original YOLO v3 model by the new anchors values.

4. The pedestrian detection method based on the improved YOLO v3 model of claim 1, wherein: the step 3 is further specifically as follows: introducing an acceptance module, then cutting and optimizing the acceptance module, wherein the cut acceptance module is mainly formed by combining a 3 × 3 convolution layer and a 5 × 5 convolution layer, meanwhile, for the 5 × 5 convolution layer, continuously using two 3 × 3 convolution layers to replace the convolution layer, merging two paths of output results of different receptive fields by a route layer in a YOLO v3 model, combining the output results into one output layer, and transmitting the output layer to a next convolution network for further feature extraction operation; and putting the clipped initiation module into the original YOLO v3 model to obtain an improved YOLO v3 model.

5. The pedestrian detection method based on the improved YOLO v3 model of claim 1, wherein: the step 4 is further specifically as follows:

C_i＝Pr(Class_i|Object) (7)

y_i＝[b_x,b_y,b_w,b_h,c](8)