CN108038409B

CN108038409B - Pedestrian detection method

Info

Publication number: CN108038409B
Application number: CN201711030102.8A
Authority: CN
Inventors: 章东平; 胡葵; 王都洋; 张香伟; 杨力; 肖刚
Original assignee: Jiangxi Gosun Guard Security Service Technology Co ltd
Current assignee: Jiangxi Gosun Guard Security Service Technology Co ltd
Priority date: 2017-10-27
Filing date: 2017-10-27
Publication date: 2021-12-28
Anticipated expiration: 2037-10-27
Also published as: CN108038409A

Abstract

The invention discloses a pedestrian detection method, which comprises the steps of carrying out convolution and pooling on an input image for multiple times through a pedestrian detection method based on a convolutional neural network, extracting the characteristics of an original image to obtain a characteristic diagram corresponding to the original image, approximately calculating the corresponding characteristic diagram after the original image is zoomed through an image characteristic pyramid rule, generating candidate windows through a region suggestion network (RPN), further selecting the candidate suggestion windows according to the size distribution of pedestrians in the candidate windows, summarizing the candidate suggestion windows, training the corresponding weights of pedestrian targets with different scales on images with different scales by utilizing labeled training data, and training a classifier network. And comparing the confidence coefficient obtained after the aggregated candidate windows pass through the classifier with a set threshold value to make final judgment on pedestrian detection. The application of the image feature pyramid avoids heavy calculation amount of obtaining the feature map by image scaling calculation, and the detection on different feature maps by using different weight weighting modes effectively avoids misjudgment and missing detection of single feature map detection.

Description

Pedestrian detection method

Technical Field

The invention relates to a pedestrian detection method, and belongs to the field of target detection.

Background

In recent years, pedestrian detection technology has been widely applied in the fields of intelligent monitoring, automatic driving, robot vision, and the like. In practical application, the pedestrian detection is very challenging due to the fact that the sizes of pedestrians captured in the video are variable in the wearing and posture of pedestrians. Pedestrian detection has two main modes: one is a traditional pedestrian detection method based on a sliding window, and the other is a pedestrian detection method based on deep learning and feature extraction. The traditional pedestrian detection method is large in calculation amount and limited in detection speed without utilizing GPU resources, and due to the fact that the performance of a computer is continuously enhanced and the GPU computing capacity is utilized, the detection speed of a deep learning method based on learning characteristics is superior to that of the traditional method mostly, but the multi-scale problem of pedestrians is difficult to solve.

Disclosure of Invention

In order to solve the problems that speed and detection precision are difficult to balance in the pedestrian detection process and the pedestrian is in multi-scale, the invention provides a pedestrian detection method, which comprises the following steps:

step (1) determining a current frame image: taking a picture in the test set as a current frame image or taking a frame image to be processed in a video sequence as the current frame image;

step (2) obtaining a characteristic diagram: passing the current frame image through a plurality of convolution layers and pooling layers, and obtaining a feature map through the last convolution layer;

and (3) feature diagram expansion: calculating a feature map corresponding to an image adjacent scale through an image feature pyramid rule, and sequentially expanding N small-scale expansion feature maps and N large-scale expansion feature maps, wherein the expansion times N and the expansion multiples are not limited, so that 2N +1 feature maps are obtained;

step (4) proposing window allocation: generating a candidate window by the feature map through a regional suggested network (RPN), and further selecting the candidate window according to the pedestrian size distribution;

step (5), training a classification network: training a deep neural network by utilizing the distribution of pedestrians in different feature maps in various scales;

and (6) pedestrian detection and labeling: and (5) summarizing the number of the proposed windows of the obtained three scale characteristic graphs in proportion, classifying by the trained classifier in the step (5), and framing out the pedestrians after non-maximum value inhibition.

Further, the step (1) is specifically as follows: taking a picture in the test set as a current frame image or taking a frame image to be processed in a video sequence as the current frame image, and recording the current frame image as I₁。

Further, the step (2) is specifically as follows: passing the current frame image through a plurality of convolutional layers and pooling layers, wherein the convolutional layers and the pooling layers are crossed and the number of layers is not limited, and the last layer is passedA feature map (feature map) is obtained for each convolution layer and is denoted as f₁。

Further, the step (3) is specifically as follows: calculating image I through image power rate rule and image characteristic pyramid rule₁Feature maps corresponding to close-up scales, typically using f_m＝C_p(S(I₁M)), wherein I)₁Representing the original image, M the zoom size, S the original image, C_pFeatures are calculated on behalf of the convolution pooling operation. Now to reduce the convolution operation and increase the running speed, the formula is used:

wherein the parameter m represents the current scale, m 'represents the scaled scale, S represents scaling the feature graph by m'/m times, f represents the feature, the constant coefficient alpha can be measured on the training set by experiments, and the formula shows that the original graph I_mCalculating features by convolution pooling, obtaining adjacent zooming image features from known feature map changes, and calculating feature maps corresponding to alpha-fold and beta-fold pictures of the original image, such as 1/2 ^ I₁And 2. about.I₁(the scale of the extended picture and the number of the extension are not limited, and the two scales are conveniently selected in consideration of the detection speed and expression) because the pyramid rule is close to the calculation every time

Multiplying, the feature map is calculated by iteration four times, and the corresponding feature map is f_1/2Because the image upsampling has no high-frequency loss, the information content of the upsampled picture is similar to the content of the low resolution, and the characteristic calculation formula is as follows:

f_σ＝σ*S(f₁,σ)) (2.2)

in the formula f₁Representing the original image to the feature map, S representing the feature map f₁Magnification of a times, f_σIs an up-sampled image.

Further, the step (4) is specifically as follows: because the RPN has a single receptive field, in the minification of the scaleThe method comprises the steps that a large target tends to be detected on an image corresponding characteristic diagram of a model, a small target tends to be detected on an enlarged image corresponding characteristic diagram, a pedestrian target in the image is divided into three scales, experiments are carried out on a KITTI data set with multi-scale pedestrians, and the pedestrians in the data set are set to be height according to different height heights<H₁,H₁≤height<H₂,...,H_n-1≤height<H_n,height≥H_nHere H₁To H_nThe number of small and large pixel points is A corresponding to the number of pedestrians with different scales respectively₁,A₂,...,A_n. Then, selecting T pedestrian candidate frames of each scale on each feature map according to the proportion distribution of the candidate frames in the feature map, and sequentially selecting T pedestrian candidate frames_uvThe number of the main components is one,

in the formula T_uvIs the number of the pedestrians with the v scale on the u characteristic diagram which needs to be extracted finally, Z_uIs the sum of the candidate windows to be extracted finally on the u-th feature map, Z_u(N is more than or equal to 1 and less than or equal to 2N +1) is determined according to the condition of the data set, the same number can be selected or different numbers can be extracted from each feature map), and A in the formula_uvThe number of the pedestrians in the v scale on the u feature map is shown. Because the proposed window network has a single receptive field (the area of the input image corresponding to the response of a certain node on the output feature map), a large target tends to be detected on the reduced-scale image corresponding feature map, and a small target tends to be detected on the enlarged-scale image corresponding feature map, so that the candidate windows are extracted according to the proportion of the targets with different scales, which is beneficial to exerting the detection advantages of the network on different feature maps.

Further, the step (5) is specifically as follows:

1) selecting a KITTI data set with various pedestrian scales for experiment, and dividing pedestrians into X-size pedestrians according to the height on a training data set (the size series is not limited);

2) the method comprises the steps of utilizing convolutional layer characteristics to share and train an RPN (regional protocol network) network and a softmax classifier combined network, adopting a cross-over alternate training mode, firstly training an RPN region suggestion network, then training a region-based classifier network by using a candidate window, and then training the RPN region suggestion network by using the classifier network. The loss layer (loss layer) is the end point of the Convolutional Neural Network (CNN), and accepts two values as inputs, one of which is the predicted value of CNN and the other is the true label. The Loss layer performs a series of operations through the predicted value and the tag value to obtain a Loss Function (Loss Function) of the current network, which is generally denoted as l (W), where W is a vector space formed by the current network weights. The purpose of training the network is to find the weight W (opt) which minimizes the loss function L (W) in the weight space, and the weight W (opt) can be approximated by an optimization method of random gradient descent (stochastic gradient device), wherein the network has two loss functions, one is a classification loss function and the other is a regression loss function;

3) because the structure in the step (3) is changed, the loss function is optimized correspondingly, the parameter to be trained and optimized is W, and the parameter is set

Wherein M is_iIs that the training is a sampled image block of interest, N is the total number of training samples, y_iE (0,1) is M_iClass label of B_i＝(m'/m)*(b_i ^x,b_i ^y,b_i ^w,b_i ^h) Is the bounding box coordinate corresponding to the feature map, where b_i ^x,b_i ^y,b_i ^w,b_i ^hRespectively representing the coordinates of the image blocks on the original image, (m'/m) is the zoom size explained in step (3);

4) the multitask penalty function thus is:

where n is the number of scale steps of the target size, E^xIs a data sample for each scale, M_iIs a collection of trainingSampled image block of interest, A₁,A₂,...,A_nRepresenting the number of pedestrians in n scales, respectively, and l is a combined loss function of classification and regression, defined as:

l(M,(y,B)|W)＝L_cls(p(M),y)+β[y≥1]L_loc(T^y,B) (2.5)

where β is a trade-off coefficient, T^yIs the predicted frame position of class y, [ y ≧ 1]Indicating that there is a regression loss, L, only in the positive case_clcAnd L_locCross entropy loss and boundary regression loss, respectively, are defined as:

in the formula p_y(M)＝p₀(M)+p₁(M)，

y ∈ (0,1) is the class label of M, T_i ^y＝(t_i ^x,t_i ^y,t_i ^w,t_i ^hIs the predicted frame position, B_i＝(m'/m)*(b_i ^x,b_i ^y,b_i ^w,b_i ^h) Is the bounding box coordinates corresponding to the feature map.

5) In 4), the prediction probability p and the prediction label T are obtained by multiplying the feature vector after the proposal and the respective weight vector, so that the combined parameters in the classification and regression processes can be continuously adjusted according to the predicted value and the label by the formula, the loss function L (W) is minimized, and the combined optimal parameter W (W) is obtained_cls,w_loc) I.e. by

Where L (W) is the multi-tasking loss function and φ is the regularization parameter.

Further, the step (6) is specifically as follows: and (4) assembling J proposed windows in the step (4), inputting feature map with fixed size through the interested pool layer and the full connection layer, classifying through the trained classifier in the step (5), and removing windows which are overlapped with the maximum confidence window by more than 65% through non-maximum suppression.

The invention discloses a pedestrian detection method based on a convolutional neural network, which is characterized in that feature graphs corresponding to adjacent sizes of original images are calculated through a feature pyramid rule of images, so that heavy calculation amount of the feature graphs obtained through image scaling calculation is avoided, and misjudgment and missing detection of single feature graph detection are effectively avoided by utilizing different weight weights for detection on different feature graphs. An effective tradeoff between pedestrian detection speed and accuracy is achieved.

Drawings

FIG. 1 is a flow chart of a convolutional neural network-based pedestrian detection method of the present invention;

FIG. 2 is a diagram of a candidate window (propofol) optimization selection algorithm for the pedestrian detection method of the present invention;

FIG. 3 is a schematic diagram of a non-maxima suppression implementation;

fig. 4 is an effect diagram of the pedestrian detection method of the present application on the KITTI dataset picture.

Detailed Description

The invention will be further explained with reference to the drawings.

As shown in fig. 1, a pedestrian detection method based on a convolutional neural network of the present invention includes the following steps:

step (2) obtaining a characteristic diagram: passing the current frame image through a plurality of convolution layers and pooling layers, and obtaining a feature map (feature map) through the last convolution layer;

and (3) feature diagram expansion: calculating a feature map corresponding to the image approach scale through an image power rate rule and an image feature pyramid rule, wherein the expanded picture scale and the expansion times are not limited;

step (4) proposing window allocation: selecting a proper pedestrian data set or manufacturing a pedestrian data set with variable scales by self, dividing the targets in the picture into three scales of small scale, medium scale and large scale (the scale series is determined by the pedestrian scale of the data set), and distributing the number of the proposal windows according to the proportion of the targets with the same scale in the picture;

and (6) pedestrian detection and labeling: and (4) summarizing the number of the candidate windows of the obtained three scale feature maps in proportion, classifying by the trained classifier in the step (5), and framing out the pedestrians after non-maximum value inhibition.

The step (1) of determining the current frame image comprises the following steps: taking a picture in the test set as a current frame image or taking a frame image to be processed in a video sequence as the current frame image, and recording the current frame image as I₁。

The step (2) of obtaining the characteristic diagram comprises the following steps: passing the current frame image through multiple convolutional layers and pooling layers, wherein the convolutional layers and pooling layers are crossed without any limit, obtaining a feature map (feature map) through the last convolutional layer, and recording as f₁。

The step of expanding the feature map in the step (3) is as follows: calculating image I through image power rate rule and image characteristic pyramid rule₁Feature maps corresponding to close-up scales, typically using f_m＝C_p(S(I₁M)), wherein I)₁Representing the original image, M the zoom size, S the original image, C_pFeatures are calculated on behalf of the convolution pooling operation. Now to reduce the convolution operation and increase the running speed, the formula is used:

wherein the parameter m represents the current scale, m 'represents the scaled scale, S represents scaling the feature map by m'/m times, f represents the feature, the constant coefficient alpha can be measured on the training set by experiment, the above formula shows that the original image I_mFeatures are obtained by convolution pooling operations, and features of the near-zoom image are derived from the convolved imageThe known feature map is obtained by approximate calculation, such as 1/2I₁Can calculate to obtain f_1/2Because the image upsampling has no high-frequency loss, the information content of the upsampled picture is similar to the content of the low resolution, and the characteristic calculation formula is as follows:

f_σ＝σ*S(f₁,σ)) (3.2)

The step of allocating the proposal window in the step (4) is as follows: because the RPN has a single receptive field, a large target tends to be detected on a reduced-scale image corresponding feature map, a small target tends to be detected on an enlarged-scale image corresponding feature map, the target in the picture is divided into three scales, an experiment is carried out on a KITTI data set with multi-scale pedestrians, and the pedestrians in the data set are set as heights according to different height heights<H₁,H₁≤height<H₂,height≥H₂These three dimensions, here H₁To H₂Respectively 50 and 200 pixel points, the number of the pedestrians corresponding to different scales is A₁,A₂,A₃. And then selecting Z in candidate windows with the height of people smaller than 50pixels on different feature maps according to the confidence degree_K*A₁/(A₁+A₂+A₃) Selecting Z in the candidate window with the pedestrian height of more than 50 and less than 200 pixels according to the confidence degree_K*A₂/(A₁+A₂+A₃) Selecting Z in the candidate window with pedestrian height greater than 200 pixels according to the confidence degree_K*A₃/(A₁+A₂+A₃) A here A₁,A₂,A₃Respectively representing the number of pedestrian candidate windows extracted in the first three different scales, wherein K is 1, 2, 3 respectively represent a reduced feature map, an original feature map and an enlarged feature map, and Z is_KAnd representing the number of candidate windows required to be extracted from each feature map. As shown in FIG. 2, f₁、f_1/2、f₂Respectively represent images I₁Features obtained by the last layer of convolutionImage I obtained by drawing and expansion calculation₁The selection number of the candidate windows is respectively as follows: z₁、Z₂、Z₃. Because the pedestrian detection of the feature map is biased to different positions, the candidate window is allocated according to the proportion of different target scales, so that the detection advantages of the network on different feature maps are favorably exerted.

The step of training the classification network in the step (5) comprises the following steps:

2) the RPN (region provider network) network and the softmax classifier combined network are trained by utilizing convolutional layer feature sharing, and the Region Proposed Network (RPN) is trained firstly, then the region-based classifier network is trained by using the proposal (provider), and finally the region proposed network is trained by using the classifier network by adopting a cross-over alternate training mode. The loss layer (loss layer) is the end point of the Convolutional Neural Network (CNN), and accepts two values as inputs, one of which is the predicted value of CNN and the other is the true label. The Loss layer performs a series of operations on the two inputs to obtain a Loss Function (Loss Function) of the current network, generally referred to as l (W), where W is a vector space formed by the current network weights. The purpose of training the network is to find the weight W (opt) which minimizes the loss function L (W) in the weight space, and the weight W (opt) can be approximated by an optimization method of random gradient descent (stochastic gradient device), wherein the network has two loss functions, one is a classification loss function and the other is a regression loss function;

4) the multitask penalty function thus is:

where n is the number of scale steps of the target size, E^xIs a data sample for each scale, M_iIs an image block of interest sampled by a training set, A₁,A₂,...,A_nRepresenting the number of pedestrians in n scales, respectively, and l is a combined loss function of classification and regression, defined as:

l(M,(y,B)|W)＝L_cls(p(M),y)+β[y≥1]L_loc(T^y,B) (3.4)

in the formula p_y(M)＝p₀(M)+p₁(M)，

5) In the step 4), the prediction probability p and the prediction label T are obtained by multiplying the proposed (pro-audible) feature vector and the weight vector of each feature vector, so that the combined parameters in the classification and regression processes can be continuously adjusted according to the predicted values and the labels by the formula, the loss function L (W) is minimized, and the combined optimal parameter W (W) is obtained_cls,w_loc) I.e. W (W)_scl,wolc)＝arg min _WL (W) + phi W. Where L (W) is the multi-tasking loss function and φ is the regularization parameter.

The step of detecting and marking the pedestrian in the step (6) is as follows: assembling J proposed windows in the step (4), inputting feature diagram size through an interest pool layer and a full connection layer to be fixed, classifying through a trained classifier to obtain confidence degrees of candidate pedestrians, and enabling the candidate pedestrians of each scale to be respectively matched with the weight l obtained through training in the step (5)^xMultiplication, by non-maximum suppression, removes windows that overlap more than 65% of the maximum confidence window, as shown in FIG. 3, S₁，S₃Respectively represent the areas of two detection frames, S₂Represents the overlapping area of two detection frames, and the intersection ratio is S₂/(S₁+S₃-S₂) If the intersection ratio is greater than the threshold of 0.65, the box with the lower confidence is discarded. Fig. 4 is a diagram of the effect of the pedestrian detection method of the present application on the image of the kitis dataset, and it can be seen that many pedestrians with a height less than 50pixels are detected. Therefore, the feasibility and the detection advantages of the pedestrian detection method can be seen.

Claims

1. A pedestrian detection method, characterized by comprising the steps of:

step (1) determining a current frame image: taking a picture in the test set as a current frame image or taking a frame image to be processed in a video sequence as the current frame image, and recording the current frame image as I₁；

Step (2) calculating a characteristic diagram: passing the current frame image through multiple convolutional layers and pooling layers, wherein the convolutional layers and pooling layers are crossed without any limit, passing the last convolutional layer to obtain a feature map (feature map), and recording as f₁；

and (4) extracting a candidate window: generating a candidate window by the feature map through a regional suggested network (RPN), and further selecting the candidate window according to the pedestrian size distribution;

step (5) training a classifier: training a deep neural network by utilizing the distribution of pedestrians with various scales in different feature maps;

and (6) pedestrian detection output: summarizing the obtained candidate windows of the multi-scale characteristic diagram, classifying by a trained classifier, and framing out the pedestrians after non-maximum value inhibition;

the step (3) is specifically as follows: computing an image I₁Feature maps corresponding to close-in scales, using f_m＝C_p(S(I₁,M))，

In the formula I₁Representing the original image, M the zoom size, S the zoom of the image, C_pRepresenting the calculation characteristics of convolution pooling operation, improving the calculation speed for reducing the convolution operation, calculating a characteristic graph corresponding to the proximity gauge image according to the image characteristic pyramid rule, wherein the calculation formula is as follows:

wherein the parameter m represents the current scale, m 'represents the scaled scale, S represents scaling the feature map by m'/m times, f represents the feature, the constant coefficient alpha can be measured on the training set by experiment, the above formula shows that the original image I_mFeatures are obtained by convolution pooling operations, and near-scale image features are approximated from known feature maps, 1/2 ^ I₁Can calculate to obtain f_1/2Because the image upsampling has no high-frequency loss, the information content of the upsampled picture is similar to the content of the low resolution, and the characteristic calculation formula is：

f_σ＝σ*S(f₁,σ) (1.2)

In the formula f₁Representing the original image to the feature map, S representing the feature map f₁Magnification of a times, f_σIs an up-sampled image;

the step (4) is specifically as follows: respectively generating candidate proposing windows by the characteristic graph through an RPN network, and setting the pedestrian scale as height according to the height of the pedestrian in the candidate window<H₁,H₁≤height<H₂,...,H_n-1≤height<H_n,height≥H_nHere H₁To H_nThe number of the pixel points from small to large is A corresponding to the number of pedestrians with different scales respectively₁,A₂,...,A_n(ii) a Then, selecting T for the pedestrian candidate frame of each scale on each feature map according to the proportion distribution of the candidate frames in the feature map_uvSequentially selecting the number T_uvThe candidate window is:

in the formula T_uvIs the number of the pedestrians with the v scale on the u characteristic diagram which needs to be extracted finally, Z_uIs the sum of the candidate windows to be extracted finally on the u-th feature map, N is more than or equal to 1 and less than or equal to 2N +1, and the same number of candidate windows or different numbers of candidate windows can be selected from each feature map according to the data set condition, wherein A is_uvThe number of the pedestrians in the v scale on the u feature map is shown.

2. The pedestrian detection method according to claim 1, characterized in that: the step (5) is specifically as follows:

1) selecting a KITTI data set with various pedestrian scales for experiment, and dividing the pedestrians into n scales of pedestrians according to the height on a training data set;

2) training a deep neural network by utilizing a training set of a KITTI data set, wherein a loss layer (loss layer) of a Convolutional Neural Network (CNN) receives two values as input, wherein one value is a predicted value of the Convolutional Neural Network (CNN), and the other value is a real label; the Loss layer carries out a series of operations through the predicted value and the label value to obtain a Loss Function (Loss Function) of the current network, and the Loss Function is marked as L (W), wherein W is a vector space formed by the weight of the current network;

3) the loss function is optimized correspondingly, the parameters of training optimization are set as W, and

wherein M is_iIs the image block of interest sampled by the training set, N is the total number of training samples, y_iE (0,1) is M_iClass label of B_i＝(m'/m)*(b_i ^x,b_i ^y,b_i ^w,b_i ^h) Is the bounding box coordinate corresponding to the feature map, where b_i ^x,b_i ^y,b_i ^w,b_i ^hRespectively representing the coordinates of an image block on an original image, (m '/m) is a scaling factor, m represents the current scale, and m' represents the scaled scale;

4) the multitask penalty function is:

where n is the number of scale steps of the target size, E^xIs a data sample for each scale, M_iIs an image block of interest sampled by a training set, A₁,A₂,...,A_nRespectively representing the number of pedestrians of n scales, l^xIs a joint loss function of classification and regression corresponding to each scale, defined as:

l(M,(y,B)|W)＝L_cls(p(M),y)+β[y≥1]L_loc(T^y,B) (1.5)

where β is a trade-off coefficient, T^yIs the predicted frame position of class y, [ y ≧1]Indicating that there is a regression loss, L, only in the positive samples_clcAnd L_locCross entropy loss and boundary regression loss, respectively, are defined as:

in the formula p_y(M)＝p₀(M)+p₁(M)，

y ∈ (0,1) is the class label of M, T_i ^y＝(t_i ^x,t_i ^y,t_i ^w,t_i ^hIs the predicted frame position, B_i＝(m'/m)*(b_i ^x,b_i ^y,b_i ^w,b_i ^h) Is the bounding box coordinate corresponding to the characteristic graph;

5) because the prediction probability p and the prediction label T in the step 4) are respectively obtained by multiplying the feature vector by the weight vector of each feature vector, the joint parameters in the classification and regression processes can be continuously adjusted according to the predicted values and the labels by the formula, so that the loss function L (W) is minimized, and the joint optimal parameter W (W) is obtained_cls,w_loc) I.e. W (W)_cls,w_loc)＝argmin_WL (W) + phi | W | where L (W) is the multitask loss function and phi is the regularization parameter.

3. The pedestrian detection method according to claim 1, characterized in that: the step (6) is specifically as follows: assembling J proposed windows in the step (4), fixing the size of the input feature map through a full connection layer, classifying by using a trained classifier to obtain the confidence coefficient of the candidate pedestrian, and judging the candidate pedestrian if the result is greater than 0.75; the framed pedestrian then passes through non-maximum suppression, removing windows that overlap more than 65% with the maximum confidence window.