CN109978035B

CN109978035B - Pedestrian detection method based on improved k-means and loss function

Info

Publication number: CN109978035B
Application number: CN201910202078.4A
Authority: CN
Inventors: 郭杰; 郑佳卉; 吴宪云; 李云松; 解静; 邱尚锋; 林朋雨
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2019-03-18
Filing date: 2019-03-18
Publication date: 2021-04-02
Anticipated expiration: 2039-03-18
Also published as: CN109978035A

Abstract

The invention provides a pedestrian detection method based on improved k-means and a loss function, which is used for classifying and identifying videos or images containing pedestrian targets and mainly solves the problems that a clustering result is inaccurate and a prediction frame cannot learn loss according to self size characteristics in the prior art, and the method comprises the following steps: constructing a training set and a test set; clustering the training set based on an improved k-means algorithm; improving a loss function of a YOLOv3 detection network; training the training set based on the improved loss function; and detecting the test set. The invention screens out invalid data in the labeling information of the training set in the clustering stage, and clusters the obtained valid data, thereby obtaining more accurate initialized size of the candidate frame, and enabling different prediction frames to learn different prediction losses according to the size characteristics of the different prediction frames, thereby obtaining more accurate pedestrian target detection network.

Description

Pedestrian detection method based on improved k-means and loss function

Technical Field

The invention belongs to the technical field of target detection, relates to a pedestrian detection method, and particularly relates to a pedestrian detection method based on improved k-means and an improved loss function, which can be used for classifying and identifying videos or images containing pedestrian targets.

Background

The pedestrian detection refers to detecting position coordinates and confidence degrees of pedestrians in videos or images, and main indexes for measuring detection results comprise detection accuracy and detection speed, wherein the most important measurement index is the detection accuracy which is often influenced by pedestrian characteristics and loss functions.

At present, the common pedestrian detection methods can be divided into two categories, namely pedestrian detection based on a traditional algorithm and pedestrian detection based on deep learning according to different extraction modes of pedestrian features.

The traditional pedestrian detection method mainly comprises a global feature detection method, a local feature extraction-based detection method and a multi-feature-based detection method. The detection method based on the global features mainly detects the contour of the pedestrian through the gradient direction histogram of the whole image so as to find the position of the pedestrian. The detection method based on the local feature extraction mainly comprises the steps of extracting the local features of an input picture and detecting the local features by matching with the features of pedestrians. The detection method based on the multiple features mainly extracts and detects various features such as gray scale, contour and the like and integrates detection results of the features. The common advantages of the three methods are simplicity and rapidness, but because the characteristics of the pedestrian are sensitive to factors such as illumination, background and shielding, background noise and light interference are easily introduced during detection, so that the detection precision of the traditional pedestrian detection method is low.

The development of deep learning brings a new idea for the research of pedestrian detection. The pedestrian detection method based on deep learning mainly comprises a detection method based on candidate frame selection and a detection method based on end-to-end, the detection method based on candidate frame selection mainly comprises the steps of manually selecting candidate frames and then carrying out network training, and although the method has a good detection effect, the detection efficiency of a network is low due to the fact that the candidate frames are selected in advance.

In recent years, an end-to-end-based detection method gradually becomes a mainstream method in the pedestrian detection field due to good detection accuracy and detection efficiency, the method takes a target detection network based on deep learning as a basic network, and initializes the size of a candidate frame by using a clustering method, so that the initial size of the candidate frame is close to the size of pedestrian features, the network is more easily converged, then a training set is trained by using a loss function to obtain a pedestrian detection network model, and finally a pedestrian detection network model is used for detecting a test set picture to obtain the position coordinates and confidence of all pedestrian targets. However, the basic network detection accuracy adopted by most pedestrian detection algorithms at present is still not ideal, such as YOLOv1, YOLOv2 and the like, and therefore the detection accuracy of all the pedestrian target detection algorithms is low. For example, a patent application having application publication No. CN 109325418A entitled "pedestrian identification method in road traffic environment based on improved YOLOv 3" discloses a method for pedestrian detection by improved YOLOv 3. According to the method, a YOLOv3 is used as a basic network, the number of candidate frames is increased in the k-means clustering process, so that the capability of extracting features of the network is increased, and then when the network is trained by using a loss function, the weight of a coordinate loss function in the loss function is increased, so that a pedestrian detection network model is obtained. But the method does not consider the condition that the labeled information in the training set is invalid when the k-means is used for clustering, so that the clustering result is inaccurate; in addition, the method does not consider the problem that the coordinate errors and the width and height errors in the coordinate loss function are different in learning proportion by prediction frames with different sizes when calculating the loss, so that the prediction frames cannot learn the loss according to the size characteristics of the prediction frames. Therefore, how to screen out effective data in the labeling information of the training set and calculate more accurate loss still remains a problem to be solved urgently in the field.

Disclosure of Invention

The invention aims to provide a pedestrian detection method based on improved k-means and a loss function aiming at overcoming the defects of the existing pedestrian detection technology, and aims to improve the detection accuracy of pedestrian targets under different scenes.

The technical idea of the invention is as follows: firstly, a training set and a test set are constructed, secondly, an improved k-means clustering algorithm is used for clustering the labeled information of the training set, the clustering result is used as a size initialization value of a YOLOv3 network candidate frame, then the training set is trained based on an improved loss function in a YOLOv3 network, and finally, the trained pedestrian detection network model is used for detecting the test set.

According to the technical idea, the technical scheme adopted for achieving the purpose of the invention comprises the following steps:

(1) constructing a training set and a testing set:

(1a) storing continuous or discontinuous N-frame images in a pedestrian video in any scene into a JPEGImages folder in a jpg picture form, and naming each picture, wherein N is more than 1000;

(1b) taking more than half of pictures in a JPEGImages folder as a training picture set, taking the rest pictures as a test picture set, writing the names of all the pictures in the training picture set into a train.txt file under an ImageSets/Main folder, and simultaneously writing the names of all the pictures in the test picture set into a test.txt file under the ImageSets/Main folder;

(1c) carrying out frame marking on different pedestrians contained in each picture in the training picture set and the test picture set, storing coordinate data of a marking frame, and then storing the type person of a pedestrian target contained in the marking frame and the coordinate data of the marking frame contained in each picture into an xml file to obtain an indications file folder consisting of a plurality of xml files, wherein the name of each xml file is the same as that of the corresponding pedestrian picture;

(1d) selecting an xml file which is selected from an options folder and has the same name as a picture in a train.txt file as a marking information set of a training picture set, taking an xml file which is selected from the options folder and has the same name as a picture in a test.txt file as a marking information set of a test picture set, writing the marking information set of the training picture set into a train.txt file under a darknet folder, writing the marking information set of the test picture set into the test.txt file under the darknet folder, wherein the training picture set and the xml marking information set corresponding to the training picture set form a training set, and the test picture set and the xml marking information set corresponding to the testing picture set form a test set;

(2) clustering the training set based on an improved k-means algorithm:

(2a) and (3) screening the labeled information in the training set:

(2a1) writing coordinate data extracted from an xml marking file corresponding to a training set into an array data _ xml with the length of l, taking a first group of coordinate data read from the data _ xml as current coordinate data, and initializing the current index value q of the first group of coordinate data in the data _ xml to be 0;

(2a2) defining coordinate data corresponding to q in data _ xml: defining the projection coordinate of the x axis corresponding to the upper left corner of the labeling frame as x_minAnd the y-axis projection coordinate corresponding to the upper left corner of the labeling box is defined as y_minAnd the projection coordinate of the x axis corresponding to the lower right corner of the labeling frame is defined as x_maxAnd the y-axis projection coordinate corresponding to the lower right corner of the labeling frame is defined as y_max；

(2a3) Calculating x_minAnd x_maxDifference x of_d，y_minAnd y_maxDifference y of_dAnd determining x_dAnd y_dWhether the data in the corresponding data _ xml is valid data or not is judged if x is_d0 or y_dX is 0_dAnd y_dDeleting the invalid data l-1 when the data in the corresponding data _ xml is invalid data, and executing the step (2a 2); if x_dNot equal to 0 and y_dNot equal to 0, then x_dAnd y_dIf the data in the corresponding data _ xml is valid data, executing the step (2a 4);

(2a4) calculating x_dAnd y_dAnd judging the validity of the data in the data _ xml corresponding to the div according to whether the div > 3 is satisfied, if so, the data in the data _ xml corresponding to the div is invalid data, deleting the invalid data, and making l-1, and executing the step (2a5), otherwise, the data in the data _ xml corresponding to the div is valid data, making q +1, and executing the step (2a 5);

(2a5) repeating the steps (2a2) - (2a4) until q is equal to l, and obtaining effective annotation information;

(2b) clustering the effective labeling information:

(2b1) setting the number of clustering centers as k, wherein k is more than 0, constructing a two-dimensional matrix data _ k with the length l of the data _ xml as the row number and the k as the column number, wherein the row of the data _ k represents effective marking information stored in the data _ xml, the list represents the value of the clustering centers, and initializing the data _ k to be 0;

(2b2) respectively carrying out random initialization on the k clustering centers;

(2b3) calculating distance values of l effective marking information and k clustering centers in the data _ xml, and writing each distance value into the position where the row corresponding to the effective marking information and the column corresponding to the clustering center in the data _ k are located;

(2b4) taking the effective marking information corresponding to each row in the data _ k as a member of a corresponding clustering center of the column where the minimum distance value in each row is located, and updating the numerical value of each clustering center into a mean value of the width and the height of the member of each clustering center;

(2b5) repeating the steps (2b3) and (2b4) until the values of the k clustering centers are not changed any more, and taking the values of the k clustering centers as clustering results;

(3) the loss function of the YOLOv3 detection network is improved:

modifying coordinate Loss function in YOLOv3 detection network Loss function into Loss'_coord：

t_i＝2-w_i×h_i

Wherein λ is_coordWeight parameter representing the network to the coordinates of the prediction box, l.w representing the size of the network divided over the picture width, l.h representing the size of the network divided over the picture height, l.n representing the number of prediction boxes in the network, i being a variable for an iteration of l.w × l.h, j being a variable for an iteration of l.n, w_iIndicates the width of the prediction box and,

width, h, of the reference frame_iIndicating the high of the prediction box that,

high, x representing the label box_iRepresenting the projection of the coordinates of the upper left corner of the prediction box on the x-axis,

denotes x_min，y_iRepresents the projection of the coordinates of the upper left corner of the prediction box on the y-axis,

denotes y_min；

(4) Training the training set based on the improved loss function:

(4a) taking the clustering result as a size initialization value of a YOLOv3 network candidate box;

(4b) performing K times of iterative training on the training set based on an improved loss function in the YOLOv3 network, wherein K is more than 10000, and obtaining a pedestrian detection network model;

(5) and (3) detecting the test set:

and inputting the test set to be detected into a pedestrian detection network model for detection to obtain the position coordinate and confidence coefficient of each pedestrian target.

Compared with the prior art, the invention has the following advantages:

the method improves the loss function in the YOLOv3, increases the learning weight of the coordinate error in the coordinate loss function for the small-size prediction frame, and avoids the defect that the prediction frame cannot learn loss according to the size characteristics of the prediction frame, and simultaneously improves the k-means clustering algorithm, screens the values of the width-height size and the width-height ratio of the marking frame in the training set, removes invalid data while retaining the valid data, clusters the valid data, and avoids the defect that the detection precision is influenced by the inaccurate clustering result caused by the invalid marking information, and simulation results show that compared with the prior art, the method effectively improves the detection precision of pedestrian detection.

Drawings

FIG. 1 is a flow chart of an implementation of the present invention.

Detailed Description

The invention is described in further detail below with reference to the figures and the specific embodiments.

Referring to fig. 1, the present invention includes the steps of:

step 1) constructing a training set and a testing set:

step 1a) extracting a frame of picture from continuous or discontinuous N frames of pictures of pedestrians in any scene shot by a camera, an unmanned aerial vehicle or a mobile phone every 10 frames and storing the pictures into a JPEGImages folder, wherein N is more than 10000, in the embodiment, the continuous 12000 frames of pictures in the videos of the pedestrians in the road shot by the mobile phone are adopted, and the pictures are named as different names, wherein the resolution of the videos is 1920 multiplied by 1080, and the number of the pictures stored in the JPEGImages folder is not less than 1000;

step 1b) taking more than half of pictures in a JPEGImages folder as a training picture set, taking the rest pictures as a test picture set, dividing the training picture set and the test picture set by using a ratio of 7:3 in the embodiment, writing the names of all pictures in the training picture set into a train.txt file under an ImageSets/Main folder, and simultaneously writing the names of all pictures in the test picture set into a test.txt file under the ImageSets/Main folder, wherein the name of each picture is taken as a line in the train.txt file and the test.txt file;

step 1c), performing frame marking on pedestrian targets contained in each picture in the training picture set and the test picture set:

step 1c1) for the category and position coordinates (x) of the pedestrian object_min,y_min,x_max,y_max) Labeling, wherein each pedestrian object is of the type person, x_minFor the corresponding x-axis projection coordinate, y, of the upper left corner of the labeling box_minFor the y-axis projection coordinate, x, corresponding to the upper left corner of the labeling box_maxFor the corresponding x-axis projection coordinate, y, of the lower right corner of the labeling box_maxProjecting coordinates of a y axis corresponding to the lower right corner of the labeling frame;

step 1c2) storing the labeling information of all pedestrian targets in each picture of the training picture set and the test picture set in an xml format to obtain an annotation folder consisting of a plurality of xml format files, wherein the name of each xml format file is the same as the picture name corresponding to the labeling information contained in the xml format file, and if the name of the labeling information file corresponding to the picture 000001.jpg is 000001.xml, the jpeggemages folder, the annotation folder and the ImageSets folder are placed in the folder darknet;

step 1d) selecting an xml file which is selected from an options folder and has the same name as a picture in a train.txt file as a marking information set of a training picture set, selecting an xml file which is selected from a test.txt file and has the same name as a picture in a test.txt file as a marking information set of a test picture set, writing the marking information set of the training picture set into a train.txt file under a darknet folder, writing the marking information set of the test picture set into the test.txt file under the darknet folder, wherein the training picture set and the corresponding xml marking information set form a training set, and the test picture set and the corresponding xml marking information set form a test set;

step 2) clustering the training set based on an improved k-means algorithm:

step 2a) screening the labeled information in the training set:

step 2a1) constructing an array data _ xml, extracting coordinate data from xml files of all training sets by using obj in python, writing the coordinate data into the data _ xml in sequence, wherein each member of the data _ xml represents a set of coordinate data, calculating the length l of the data _ xml by using a len function in python, reading the first set of coordinate data in the data _ xml, and initializing the current index value q of the data _ xml to be 0;

step 2a2) defining coordinate data corresponding to q in data _ xml: defining the projection coordinate of the x axis corresponding to the upper left corner of the labeling frame as x_minAnd the y-axis projection coordinate corresponding to the upper left corner of the labeling box is defined as y_minAnd the projection coordinate of the x axis corresponding to the lower right corner of the labeling frame is defined as x_maxAnd the y-axis projection coordinate corresponding to the lower right corner of the labeling frame is defined as y_max；

Step 2a3) calculating x_minAnd x_maxDifference x of_d，y_minAnd y_maxDifference y of_dWherein x is_min、x_max、y_minAnd y_maxAre all floating point type numbers, and determine x_dAnd y_dWhether the data in the corresponding data _ xml is valid data or not is judged if x is_d0 or y_dX is 0_dAnd y_dThe data in the corresponding data _ xml is invalid data, and de in python is usedThe l function deletes the invalid set of data in data _ xml, l ═ l-1, and performs step (2a 2); if x_dNot equal to 0 and y_dNot equal to 0, then x_dAnd y_dIf the data in the corresponding data _ xml is valid data, executing the step (2a 4);

step 2a4) calculating x_dAnd y_dIf yes, the data in the data _ xml corresponding to the div is invalid data, the set of invalid data is deleted in the data _ xml by using a del function in python, and the step (2a5) is executed, otherwise, the data in the data _ xml corresponding to the div is valid data, and the step (2a5) is executed, so that q is q + 1;

step 2a5) repeatedly executing the steps (2a2) - (2a4) until q is equal to l, and obtaining valid annotation information, namely all annotation information in the data _ xml at this time;

step 2b) clustering the effective labeling information:

step 2b1) manually setting the number of the clustering centers as k, wherein k is greater than 0, in this embodiment, k is 9, constructing a two-dimensional matrix data _ k, the row number of the two-dimensional matrix data _ k is the length l of the data _ xml at this time, the column number is k, the row of the data _ k represents effective marking information stored in the data _ xml, the column represents the value of the clustering centers, and the data _ k is initialized to 0 by np.zeros in python;

step 2b2) randomly initializing k clustering centers respectively by np in python, wherein each clustering center is a group of floating point type arrays with the length of 2, and writing the values of the clustering centers into boxes named clusters;

step 2b3) calculating distance values d (box, centroid) between l effective labeling information in the data _ xml and k clustering centers, wherein the calculation expression is as follows:

d(box,centroid)＝1-IOU(box,centroid)

box＝x_d×y_d

wherein, the centroid represents the product of two floating point type members in the cluster center, box ^ centroid represents the intersection of box and centroid, and box ^ gou centroid represents the union of box and centroid, and then each d (box, centroid) is written into the position where the row corresponding to the effective marking information in data _ k and the column corresponding to the cluster center are located;

step 2b4) uses np.argmin in python to calculate the column where the minimum distance value in each row of data _ k is located and records it into the variable nearest _ clusters, and updates each cluster center in python using the following statements:

clusters[cluster]＝dist(boxes[nearest_clusters＝＝cluster],axis＝0)

wherein, cluster is the index of the cluster center, and the cluster is added once in python until all cluster centers are updated, and the updated cluster centers are still stored in boxes named as cluster;

step 2b5) repeating the steps (2b3) and (2b4) until the values of k cluster centers are not changed any more, and taking the values of k cluster centers as a clustering result;

step 3) improving a loss function of the YOLOv3 detection network:

modifying the coordinate Loss function in the delta _ region _ box function of the region _ layer.c file in the darknet/src folder to Loss'_coord：

t_i＝2-w_i×h_i

The complete modified Loss function Loss' in YOLOv3 is:

Loss'＝Loss_noobj+Loss_obj+Loss_class+Loss'_coord

therein, Loss_noobjLoss of confidence function, Loss, representing a prediction box that does not contain an object_objLoss, confidence Loss function representing a prediction box containing an object_classRepresents a class Loss function, Loss'_coordRepresenting an improved coordinate loss function, λ_coordWeight parameter representing the network to the coordinates of the prediction box, l.w representing the size of the network divided over the picture width, l.h representing the size of the network divided over the picture height, l.n representing the number of prediction boxes in the network, i being a variable for an iteration of l.w × l.h, j being a variable for an iteration of l.n, w_iIndicates the width of the prediction box and,

denotes y_min，Loss_noobjLoss of confidence function, Loss, representing a prediction box that does not contain an object_objLoss of confidence function for the prediction box containing the target, Loss_classAs a function of class loss, λ_noobjThe coefficients corresponding to the prediction blocks that represent no object,

is a parameter indicating whether the prediction box does not contain a target, c_iFor the purpose of predicting the confidence of the box,

to label the box confidence, λ_objRepresenting the coefficients corresponding to the prediction box containing the target,

is a parameter indicating whether the prediction box contains a target; lambda [ alpha ]_classRepresenting coefficients corresponding to prediction boxes containing object classes, c representing an iteration variable for a class, class representing the total class in the dataset, p_i(c) Indicates the probability that the prediction box contains the c category,

representing the probability that the label box contains the class c;

step 4) training the training set based on the improved loss function:

step 4a) carrying out initialization setting on training parameters of the pedestrian detection network:

modifying paths of a training set and a test set in the voc.data file, setting the maximum iteration times max _ batches to 50200 times, setting the picture batch processing size to 64, and setting the initial learning rate to 10^-3Momentum of 0.9;

step 4b) taking the clustering result as the size initialization value of the Yolov3 network candidate box:

writing the clustering results in anchors in yolov3-voc.cfg file;

step 4c) performing K times of iterative training on the training set based on the improved loss function in the YOLOv3 network, wherein K is more than 10000, and K in the embodiment is 20000, so as to obtain a pedestrian detection network model;

step 5) detecting the test set:

step 5a) entering shell commands under the darknet folder:

./darknet detector test cfg/voc.data cfg/yolov3-voc.cfg yolov3-voc_20000.weights

and step 5b), the pedestrian detection network model performs forward calculation on the read-in test set picture through an improved loss function according to the input shell command to obtain the position coordinate and confidence coefficient of each pedestrian target, and the position coordinate and the confidence coefficient are stored in a data/out folder.

The technical effects of the invention are further explained by combining simulation experiments as follows:

1. simulation conditions and contents:

the simulation experiment of the invention is realized in the configuration environment of Intel (R) Xeon (R) CPU E5-2650 v4@2.20GHz, GeForce GTX 1080ti x4 and 32G internal memory. Pedestrian video data used in the experiment is derived from pedestrians on roads in and near the campus of the western-style electronic technology university actually shot by a red-rice note7 mobile phone.

Simulation experiment: compared and simulated with the prior art, the method for detecting the pedestrian based on the improved k-means and the improved loss function comprises the steps of firstly screening effective data of the marking information of the training set by utilizing the improved k-means after constructing the training set and the test set according to the invention, then respectively clustering the effective data and all the data in the marking information of the training set to obtain respective clustering results, respectively taking the two clustering results as the initial sizes of the improved loss function-based Yolov3 and the network candidate frame in the prior art, then respectively training the training set by utilizing the improved loss function in the Yolov3, simultaneously training the training set by utilizing the network in the prior art 20000 times, finally obtaining respective pedestrian detection network models, respectively inputting the test set into the two pedestrian detection network models to obtain the position coordinate and the confidence result of each pedestrian target respectively detected by the two models, and counting the detection precision of the two methods, wherein the specific detection precision is compared as shown in the following table.

2. And (3) simulation result analysis:

compared with the prior art, the pedestrian detection result obtained by the invention has obvious advantages, and the detection precision of the prior art and the detection precision of the invention are shown in the table 1:

TABLE 1 detection accuracy contrast table

Evaluation index	Prior Art	The invention
			Detection accuracy	87.3	89.0

As is apparent from the table, the detection precision obtained by the method is higher, and the detection effect of the method on the pedestrian target is better than that of the prior art.

The above description is only one specific example of the present invention and should not be construed as limiting the invention in any way. It will be apparent to persons skilled in the relevant art(s) that various modifications and changes in form and detail can be made therein without departing from the principles and arrangements of the invention, but these modifications and changes based on the inventive concept are also within the scope of the invention as defined in the appended claims.

Claims

1. A pedestrian detection method based on improved k-means and a loss function, comprising the steps of:

(1) constructing a training set and a testing set:

(1a) storing continuous or discontinuous N-frame images in a pedestrian video in any scene into a JPEGImages folder in a jpg picture form, and naming each picture, wherein N is more than 10000;

(2) clustering the training set based on an improved k-means algorithm:

(2a) and (3) screening the labeled information in the training set:

(2b) clustering the effective labeling information:

(3) the loss function of the YOLOv3 detection network is improved:

t_i＝2-w_i×h_i

Wherein λ is_coordWeight parameters representing the network to the coordinates of the prediction box, l.w representing the size of the network divided over the picture width, l.h representing the size of the network divided over the picture height, l.n representing the number of prediction boxes in the network, i being a variable for an iteration of l.w × l.h, j being a variable for an iteration of l.n,

a parameter, w, indicating whether the prediction box contains an object_iIndicates the width of the prediction box and,

denotes y_min；

(4) Training the training set based on the improved loss function:

(5) and (3) detecting the test set:

2. The pedestrian detection method based on the improved k-means and the loss function of claim 1, wherein the YOLOv3 in the step (3) is used for detecting the loss function of the network, and the calculation expression is

Loss＝Loss_noobj+Loss_obj+Loss_class+Loss_coord

t_i＝2-w_i×h_i

Wherein Loss denotes the Loss function, Loss_noobjLoss of confidence function, Loss, representing a prediction box that does not contain an object_objRepresenting prediction boxes containing objectsLoss of confidence function, Loss_classRepresents the class Loss function, Loss_coordRepresenting the coordinate loss function, λ_noobjCoefficients corresponding to prediction frames not including the target are represented, l.w represents the division size of the network in the picture width direction, l.h represents the division size of the network in the picture height direction, i, j are respectively corresponding iteration variables,

the confidence of the labeling box is obtained; lambda [ alpha ]_objRepresenting the coefficients corresponding to the prediction box containing the target,

denotes the probability, λ, of the label box containing the c class_coordWeight parameter, w, representing the coordinates of the prediction box by the network_iIndicates the width of the prediction box and,

denotes y_min。