CN108038409A

CN108038409A - A kind of pedestrian detection method

Info

Publication number: CN108038409A
Application number: CN201711030102.8A
Authority: CN
Inventors: 章东平; 胡葵; 王都洋; 张香伟; 杨力; 肖刚
Original assignee: Jiangxi Gao Chuan Security Service Technology Co Ltd
Current assignee: Jiangxi Gao Chuan Security Service Technology Co Ltd
Priority date: 2017-10-27
Filing date: 2017-10-27
Publication date: 2018-05-15
Anticipated expiration: 2037-10-27
Also published as: CN108038409B

Abstract

The invention discloses a kind of pedestrian detection method, pass through the pedestrian detection method based on convolutional neural networks, multiple convolution and pond are carried out to input picture, extract artwork feature, obtain the corresponding characteristic pattern of artwork, corresponding characteristic pattern after artwork scales is gone out by characteristics of image pyramid rule approximate calculation, suggest network RPN generation candidate windows by region respectively, Candidate Submission window is further selected by candidate window one skilled in the art Size Distribution and is collected, the pedestrian target of the different scale corresponding weight on different scales image is trained using the training data of tape label, training grader network.Try to achieve the judgement that the confidence level that the candidate window after collecting obtains after grader makes final pedestrian detection compared with the threshold value of setting.The pyramidal application of characteristics of image avoids the heavy calculation that characteristic pattern is calculated in image scaling, and detects the erroneous judgement and missing inspection that effectively prevent the detection of single features figure on different characteristic figure using the mode that different weights weight.

Description

Pedestrian detection method

Technical Field

The invention relates to a pedestrian detection method, and belongs to the field of target detection.

Background

In recent years, pedestrian detection technology has been widely applied in the fields of intelligent monitoring, automatic driving, robot vision, and the like. In practical application, the pedestrian detection is very challenging due to the fact that the sizes of pedestrians captured in the video are variable in the wearing and posture of pedestrians. Pedestrian detection has two main modes: one is a traditional pedestrian detection method based on a sliding window, and the other is a pedestrian detection method based on deep learning and feature extraction. The traditional pedestrian detection method is large in calculation amount and limited in detection speed without using GPU resources, because the performance of a computer is continuously enhanced and the GPU calculation capability is used, the detection speed of most deep learning methods based on learning characteristics is superior to that of the traditional method, but the multi-scale problem of pedestrians is difficult to solve.

Disclosure of Invention

In order to solve the problems that speed and detection precision are difficult to balance in the pedestrian detection process and the pedestrian is in multi-scale, the invention provides a pedestrian detection method, which comprises the following steps:

determining a current frame image: taking a picture in the test set as a current frame image or taking a frame image to be processed in a video sequence as the current frame image;

step (2) obtaining a characteristic diagram: passing the current frame image through a plurality of convolution layers and pooling layers, and obtaining a feature map through the last convolution layer;

and (3) feature diagram expansion: calculating a feature map corresponding to an image adjacent scale through an image feature pyramid rule, and sequentially expanding N small-scale expansion feature maps and N large-scale expansion feature maps, wherein the expansion times N and the expansion multiples are not limited, so that 2N +1 feature maps are obtained;

step (4) proposing window allocation: generating a candidate window by the feature map through a regional recommendation network (RPN), and further selecting the candidate window according to the pedestrian size distribution;

step (5) classification network training: training a deep neural network by utilizing the distribution of pedestrians in different feature maps in various scales;

step (6), pedestrian detection and labeling: and (4) summarizing the number of the proposed windows of the obtained three scale feature maps in proportion, classifying by the trained classifier in the step (5), and framing the pedestrians after non-maximum value inhibition.

Further, the step (1) is specifically as follows: taking a picture in the test set as a current frame image or taking a frame image to be processed in a video sequence as a current frame image, and marking as I ₁ 。

Further, the step (2) is specifically as follows: passing the current frame image through multiple convolutional layers and pooling layers, wherein the convolutional layers and pooling layers are crossed without any limit, obtaining a feature map (feature map) through the last convolutional layer, and recording as f ₁ 。

Further, the step (3) is specifically: calculating image I through image power law rule and image characteristic pyramid rule ₁ Feature maps corresponding to near-scale, typically using f _m ＝C _p (S(I ₁ M)), in which formula I ₁ Representing the original image, M the zoom size, S the original image, C _p Features are calculated on behalf of the convolution pooling operation. Now to reduce convolution operations and increase operating speed, the formula is used:

wherein the parameter m represents the current scale, m 'represents the scaled scale, S represents scaling the feature graph by m'/m times, f represents the feature, the constant coefficient alpha can be measured on the training set by experiment, the formula shows the originalFIG. I _m Calculating features by convolution pooling operation, obtaining image features close to zoom scale from known feature map change, and calculating feature maps corresponding to the original image with alpha and beta times of size, such as 1/2 × I ₁ And 2 x I ₁ (the scale of the extended picture and the number of the extension are not limited, and the two scales are convenient to select in consideration of detection speed and expression) because the pyramid rule performs proximity calculation every timeMultiplying, calculating the characteristic graph by iteration four times, and taking the corresponding characteristic graph as f _1/2 Because there is no high frequency loss in the image upsampling, the information content of the upsampled picture is similar to the content of the low resolution, and the characteristic calculation formula is:

f _σ ＝σ*S(f ₁ ,σ)) (2.2)

in the formula f ₁ Representing the original image to the feature map, S representing the feature map f ₁ Magnification of a factor of sigma, f _σ Is an up-sampled image.

Further, the step (4) is specifically as follows: because the RPN has a single receptive field, a large target tends to be detected on a reduced-scale image corresponding feature map, a small target tends to be detected on an enlarged-scale image corresponding feature map, a pedestrian target in the image is divided into three scales, an experiment is carried out on a KITTI data set with multi-scale pedestrians, and the pedestrians in the data set are set to be height according to different height heights<H ₁ ,H ₁ ≤height< H ₂ ,...,H _n-1 ≤height<H _n ,height≥H _n Here H ₁ To H _n The number of small and large pixel points corresponding to the number of pedestrians with different scales is A respectively ₁ ,A ₂ ,...,A _n . Then, selecting T pedestrian candidate frames of each scale on each feature map according to the proportion distribution of the candidate frames in the feature map, and sequentially selecting T pedestrian candidate frames _uv The number of the main components is one,

in the formula T _uv Is the number of the pedestrians with the v scale on the u characteristic diagram which needs to be extracted finally, Z _u Is the sum of the candidate windows to be extracted finally on the u-th feature map, Z _u (n is more than or equal to 1 and less than or equal to 2N + 1) according to the data set condition, and the same number or different numbers can be selected from each characteristic diagram), wherein A in the formula _uv The number of the pedestrians in the v scale on the u feature map is shown. Because the proposed window network has a single receptive field (the area of the input image corresponding to the response of a certain node on the output feature map), a large target tends to be detected on the reduced-scale image corresponding feature map, and a small target tends to be detected on the enlarged-scale image corresponding feature map, so that the candidate windows are extracted according to the proportion of targets with different scales, which is beneficial to exerting the detection advantages of the network on different feature maps.

Further, the step (5) is specifically as follows:

1) Selecting a KITTI data set with various pedestrian scales for experiment, and dividing pedestrians into X-size pedestrians according to the height on a training data set (the size series is not limited);

2) The method comprises the steps of sharing and training an RPN (region proxy) network and a softmax classifier combined network by utilizing convolutional layer characteristics, training an RPN region suggestion network by adopting a cross-rotation training mode, then training a region-based classifier network by using a candidate window, and then training the RPN region suggestion network by using the classifier network. The loss layer (loss layer) is the end point of the Convolutional Neural Network (CNN), and accepts two values as input, wherein one value is the predicted value of the CNN, and the other value is a real label. The Loss layer performs a series of operations through the predicted value and the tag value to obtain a Loss Function (Loss Function) of the current network, which is generally referred to as L (W), where W is a vector space formed by the current network weight. The purpose of training the network is to find a weight W (opt) which minimizes a loss function L (W) in a weight space, the weight W (opt) can be approximated by an optimization method of random gradient descent (stochastic gradient device), and the network has two loss functions, one is a classification loss function and the other is a regression loss function;

3) Because the structure in the step (3) is changed, the loss function is optimized correspondingly, the parameter to be trained and optimized is W, and the parameter is setWherein M is _i Is that the training is a sampled image block of interest, N is the total number of training samples, y _i E (0, 1) is M _i Class label of B _i ＝(m'/m)*(b _i ^x ,b _i ^y ,b _i ^w ,b _i ^h ) Is the bounding box coordinate corresponding to the feature map, where b _i ^x ,b _i ^y ,b _i ^w ,b _i ^h Respectively representing the coordinates of the image blocks on the original image, (m'/m) is the zoom size explained in step (3);

4) The multitask penalty function thus is:

where n is the number of scale steps of the target size, E ^x Is a data sample for each scale, M _i Is an image block of interest sampled by a training set, A ₁ ,A ₂ ,...,A _n Representing the number of pedestrians in n scales, respectively, and l is a combined loss function of classification and regression, defined as:

l(M,(y,B)|W)＝L _cls (p(M),y)+β[y≥1]L _loc (T ^y ,B) (2.5)

where β is a trade-off coefficient, T ^y Is the predicted frame position of class y, [ y ≧ 1]Indicates that there is a regression loss only at the positive, L _clc And L _loc Cross entropy loss and boundary regression loss, respectively, are defined as:

in the formula p _y (M)＝p ₀ (M)+p ₁ (M)，y ∈ (0, 1) is class label of M, T _i ^y ＝(t _i ^x ,t _i ^y ,t _i ^w ,t _i ^h Is the predicted bounding box position, B) _i ＝(m'/m)*(b _i ^x ,b _i ^y ,b _i ^w ,b _i ^h ) Is the bounding box coordinates corresponding to the feature map.

5) Because the prediction probability p and the prediction label T in 4) are obtained by multiplying the feature vector after the proposal and the respective weight vector, the combined parameters in the classification and regression processes can be continuously adjusted according to the predicted value and the label by the formula, so that the loss function L (W) is minimized, and the combined optimal parameter W (W) is obtained _cls ,w _loc ) I.e. byWhere L (W) is the multi-tasking loss function and φ is the regularization parameter.

Further, the step (6) is specifically as follows: and (5) aggregating the proposed windows in the step (4) into J windows, inputting the characteristic graphs with fixed sizes through the interested pool layer and the full connection layer, classifying through the trained classifier in the step (5), and removing windows which are overlapped with the maximum confidence window by more than 65% through non-maximum suppression.

The invention discloses a pedestrian detection method based on a convolutional neural network, which is characterized in that feature graphs corresponding to adjacent sizes of original images are calculated through a feature pyramid rule of images, so that heavy calculation amount of the feature graphs obtained through image scaling calculation is avoided, and misjudgment and missing detection of single feature graph detection are effectively avoided by utilizing different weight weights for detection on different feature graphs. An effective tradeoff between pedestrian detection speed and accuracy is achieved.

Drawings

FIG. 1 is a flow chart of a convolutional neural network-based pedestrian detection method of the present invention;

FIG. 2 is a diagram of a candidate window (propofol) optimization selection algorithm for the pedestrian detection method of the present invention;

FIG. 3 is a schematic diagram of a non-maxima suppression implementation;

fig. 4 is an effect diagram of the pedestrian detection method of the present application on the KITTI dataset picture.

Detailed Description

The invention will be further described with reference to the accompanying drawings.

As shown in fig. 1, the pedestrian detection method based on the convolutional neural network of the present invention includes the following steps:

step (2) obtaining a characteristic diagram: passing the current frame image through a plurality of convolution layers and pooling layers, and obtaining a feature map (feature map) through the last convolution layer;

and (3) feature diagram expansion: calculating a feature map corresponding to the image approach scale through an image power rate rule and an image feature pyramid rule, wherein the expanded picture scale and the expansion times are not limited;

step (4) proposing window allocation: selecting a proper pedestrian data set or manufacturing a pedestrian data set with variable scales by self, dividing the targets in the picture into three scales of small scale, medium scale and large scale (the scale series is determined by the pedestrian scale of the data set), and distributing the number of the proposal windows according to the proportion of the targets with the same scale in the picture;

step (5) classification network training: training a deep neural network by utilizing the distribution of pedestrians with various scales in different feature maps;

step (6), pedestrian detection and labeling: and (4) summarizing the number of the candidate windows of the obtained three scale feature maps in proportion, classifying by the trained classifier in the step (5), and framing out the pedestrians after non-maximum value inhibition.

The step (1) of determining the current frame image comprises the following steps: taking a picture in the test set as a current frame image or taking a frame image to be processed in a video sequence as the current frame image, and recording the current frame image as I ₁ 。

The step (2) of obtaining the characteristic diagram comprises the following steps: passing the current frame image through multiple convolutional layers and pooling layers, wherein the convolutional layers and pooling layers are crossed without any limit, obtaining a feature map (feature map) through the last convolutional layer, and recording as f ₁ 。

The step of expanding the feature map in the step (3) is as follows: calculating image I through image power rate rule and image characteristic pyramid rule ₁ Feature maps corresponding to close-up scales, typically using f _m ＝C _p (S(I ₁ M)), in which formula I ₁ Representing the original image, M the zoom size, S the original image, C _p Features are calculated on behalf of the convolution pooling operation. Now to reduce the convolution operation and increase the running speed, the formula is used:

wherein the parameter m represents the current scale, m 'represents the scaled scale, S represents scaling the feature map by m'/m times, f represents the feature, the constant coefficient alpha can be measured on the training set by experiment, the above formula shows that the original image I _m Features are obtained by convolution pooling operations, and near-scale image features are approximated from known feature maps, e.g., 1/2 × I ₁ Can calculate to obtain f _1/2 Because there is no high frequency loss in the image upsampling, the information content of the upsampled picture is similar to the content of the low resolution, and the characteristic calculation formula is:

f _σ ＝σ*S(f ₁ ,σ)) (3.2)

The step of allocating the proposal window in the step (4) is as follows:because the RPN has a single receptive field, a large target tends to be detected on a corresponding characteristic diagram of an image with a reduced scale, a small target tends to be detected on a corresponding characteristic diagram of an image with an enlarged scale, the target in the image is divided into three scales, an experiment is carried out on a KITTI data set with multi-scale pedestrians, and the pedestrians in the data set are set to be height according to different height heights< H ₁ ,H ₁ ≤height<H ₂ ,height≥H ₂ These three dimensions, here H ₁ To H ₂ Respectively 50 and 200 pixel points, the number of the pedestrians corresponding to different scales is A ₁ ,A ₂ ,A ₃ . And then selecting Z in candidate windows with people height smaller than 50pixels on different feature maps according to the confidence degree _K *A ₁ /(A ₁ +A ₂ +A ₃ ) Selecting Z in the candidate window with the pedestrian height of more than 50 and less than 200 pixels according to the confidence degree _K *A ₂ /(A ₁ +A ₂ +A ₃ ) Selecting Z in the candidate window with pedestrian height greater than 200 pixels according to the confidence degree _K *A ₃ /(A ₁ +A ₂ +A ₃ ) A here A ₁ ,A ₂ ,A ₃ Respectively representing the number of the pedestrian candidate windows extracted in the first three different scales, K =1,2,3 respectively representing a reduced characteristic diagram, an original characteristic diagram and an enlarged characteristic diagram, Z _K And representing the number of candidate windows required to be extracted from each feature map. As shown in FIG. 2, f ₁ 、f _1/2 、f ₂ Respectively represent images I ₁ Feature map obtained by convolution of last layer and image I obtained by expansion calculation ₁ The selected number of the candidate windows is respectively as follows: z ₁ 、Z ₂ 、Z ₃ . Because the characteristic map detects pedestrian deviation in different modes of proposing the candidate windows by adopting proportion distribution, the candidate windows are distributed according to the proportion of different target scales, so that the detection advantages of the network on different characteristic maps are favorably exerted.

The step of training the classification network in the step (5) comprises the following steps:

2) The method comprises the steps of utilizing convolutional layer characteristics to share and train an RPN (region provider network) network and a softmax classifier combined network, adopting a cross-over alternate training mode, firstly training a Region Proposed Network (RPN), then training a region-based classifier network by using a proposal (provider), and then training the region proposed network by using the classifier network. The loss layer (loss layer) is the end point of the Convolutional Neural Network (CNN), and accepts two values as inputs, one of which is the predicted value of CNN and the other is the true label. The Loss layer performs a series of operations on the two inputs to obtain a Loss Function (Loss Function) of the current network, which is generally referred to as L (W), where W is a vector space formed by the current network weights. The purpose of training the network is to find a weight W (opt) which minimizes a loss function L (W) in a weight space, the weight W (opt) can be approximated by an optimization method of random gradient descent (stochastic gradient device), and the network has two loss functions, one is a classification loss function and the other is a regression loss function;

4) The multitask penalty function thus is:

l(M,(y,B)|W)＝L _cls (p(M),y)+β[y≥1]L _loc (T ^y ,B) (3.4)

where β is a trade-off coefficient, T ^y Is the predicted frame position of class y, [ y ≧ 1]Indicating that there is a regression loss, L, only in the positive case _clc And L _loc Cross entropy loss and boundary regression loss, respectively, are defined as:

in the formula p _y (M)＝p ₀ (M)+p ₁ (M)，y ∈ (0, 1) is the class label of M, T _i ^y ＝(t _i ^x ,t _i ^y ,t _i ^w ,t _i ^h Is the predicted frame position, B _i ＝(m'/m)*(b _i ^x ,b _i ^y ,b _i ^w ,b _i ^h ) Is the bounding box coordinates corresponding to the feature map.

5) Because the prediction probability p and the prediction label T in the step 4) are obtained by multiplying the feature vector after proposal (pro) and the respective weight vector, the joint parameters in the classification and regression processes can be continuously adjusted according to the predicted value and the label by the formula, so that the loss function L (W) is minimized, and the joint optimal parameter W (W) is obtained _cls ,w _loc ) I.e. W (W) _scl ,wolc)＝arg min _W L (W) + phi W. Where L (W) is the multi-tasking loss function and φ is the regularization parameter.

The step of detecting and marking the pedestrian in the step (6) is as follows: assembling J proposed windows in the step (4), inputting feature diagram size through an interest pool layer and a full connection layer to be fixed, classifying through a trained classifier to obtain confidence degrees of candidate pedestrians, and enabling the candidate pedestrians of each scale to be respectively matched with the weight l obtained through training in the step (5) ^x Multiplication, by non-maximum suppression, removes windows that overlap more than 65% of the maximum confidence window, as shown in FIG. 3, S ₁ ，S ₃ Respectively represent the areas of two detection frames, S ₂ Represents the overlapping area of two detection frames, and the intersection ratio is S ₂ /(S ₁ +S ₃ -S ₂ ) If the intersection ratio is greater than the threshold of 0.65, the box with the lower confidence is discarded. Fig. 4 is a diagram showing the effect of the pedestrian detection method of the present application on the image of the kitis dataset, and it can be seen that many pedestrians with a height less than 50pixels are detected. Therefore, the feasibility and the detection advantages of the pedestrian detection method can be seen.

Claims

1. A pedestrian detection method, characterized by comprising the steps of:

determining a current frame image: taking a picture in the test set as a current frame image or taking a frame image to be processed in a video sequence as a current frame image, and marking as I ₁ ；

Step (2) calculating a characteristic diagram: passing the current frame image through multiple convolutional layers and pooling layers, wherein the convolutional layers and pooling layers are crossed without limitation, passing the last convolutional layer to obtain a feature map (feature map), denoted as f ₁ ；

and (4) extracting a candidate window: generating a candidate window by the feature map through a regional recommended network (RPN), and further selecting the candidate window according to pedestrian size distribution;

step (5) training a classifier: training a deep neural network by utilizing the distribution of pedestrians in different feature maps in various scales;

and (6) pedestrian detection output: and summarizing the obtained candidate windows of the multi-scale characteristic diagram, classifying by a trained classifier, and framing the pedestrians after non-maximum value inhibition.

2. The pedestrian detection method according to claim 1, characterized in that: the step (3) is specifically as follows: computing an image I ₁ Feature maps corresponding to close-up scales, typically using f _m ＝C _p (S(I ₁ M)), wherein I) ₁ Representing the original image, M the zoom size, S the zoom of the image, C _p Representing the calculation characteristics of convolution pooling operation, improving the calculation speed for reducing the convolution operation, calculating a characteristic graph corresponding to the proximity gauge image according to the image characteristic pyramid rule, wherein the calculation formula is as follows:

wherein the parameter m represents the current scale, m 'represents the scaled scale, S represents scaling the feature map by m'/m times, f represents the feature, the constant coefficient alpha can be measured on the training set by experiment, the above formula shows that the original image I _m Features are obtained by convolution pooling operations, and near-scale image features are approximated from known feature maps, e.g. 1/2 × I ₁ F1/2 can be calculated, because the image upsampling has no high-frequency loss, the information content of the upsampled picture is similar to the content of the low resolution, and the characteristic calculation formula is as follows:

f _σ ＝σ*S(f ₁ ,σ)) (1.2)

in the formula f ₁ Representing the original image to the feature map, S representing the feature map f ₁ Magnification of a times, f _σ Is an up-sampled image.

3. The pedestrian detection method according to claim 1The method is characterized in that: the step (4) is specifically as follows: respectively generating candidate proposing windows by the characteristic graph through an RPN network, and setting the pedestrian scale as height according to the height of the pedestrian in the candidate window<H ₁ ,H ₁ ≤height<H ₂ ,...,H _n-1 ≤height<H _n ,height≥H _n Here H ₁ To H _n The number of small and large pixel points is A corresponding to the number of pedestrians with different scales respectively ₁ ,A ₂ ,...,A _n (ii) a Then, selecting T for the pedestrian candidate frames of each scale on each feature map according to the proportion distribution of the candidate frames in the feature map _uv Sequentially selecting the number T _uv The candidate window is:

in the formula T _uv The number of the pedestrians with the v scale on the u feature map needing to be extracted finally, Z _u Is the sum of the candidate windows to be extracted finally on the u-th feature map, Z _u (n is more than or equal to 1 and less than or equal to 2N + 1) according to the data set condition, the same number can be selected and different numbers can be extracted on each characteristic diagram), wherein A is _uv The number of the pedestrians in the v scale on the u feature map is shown.

4. The pedestrian detection method according to claim 1, characterized in that: the step (5) is specifically as follows:

1) Selecting a KITTI data set with various pedestrian scales for experiment, and dividing the pedestrians into n scales of pedestrians according to the height on a training data set;

2) The deep neural network is trained by using a training set of KITTI data sets, and a loss layer (loss layer) of the Convolutional Neural Network (CNN) receives two values as input, wherein one value is a predicted value of the Convolutional Neural Network (CNN) and the other value is a real label. The Loss layer carries out a series of operations through the predicted value and the label value to obtain a Loss Function (Loss Function) of the current network, generally denoted as L (W), wherein W is a vector space formed by the weight of the current network;

3) The loss function is optimized correspondingly, the parameter of training optimization is set as W, and the parameter is set as Wherein M is _i Is the image block of interest sampled by the training set, N is the total number of training samples, y _i E (0, 1) is M _i The class label of (a) is used,is the bounding box coordinate corresponding to the feature map, whereinRespectively representing the coordinates of an image block on an original image, (m '/m) is a scaling factor, m represents the current scale, and m' represents the scaled scale;

4) The multitask penalty function is:

l(M,(y,B)|W)＝L _cls (p(M),y)+β[y≥1]L _loc (T ^y ,B) (1.5)

in the formula p _y (M)＝p ₀ (M)+p ₁ (M)，y ∈ (0, 1) is the class label of M,it is the predicted position of the bounding box, is the bounding box coordinate corresponding to the characteristic graph;

5) Because the prediction probability p and the prediction label T in the step 4) are respectively obtained by multiplying the feature vector by the weight vector of each, the joint parameters in the classification and regression processes can be continuously adjusted according to the predicted values and the labels by the formula, so that the loss function L (W) is minimized, and the joint optimal parameter W (W) is obtained _cls ,w _loc ) I.e. W (W) _cls ,w _loc )＝argmin _W L (W) + phi W, where L (W) is the multitask loss function and phi is the regularization parameter.

5. The pedestrian detection method according to claim 1, characterized in that: the step (6) is specifically as follows: assembling J proposed windows in the step (4), fixing the size of the input feature map through a full connection layer, classifying through a trained classifier to obtain the confidence coefficient of the candidate pedestrian, and judging as the pedestrian if the result is greater than 0.75; the framed pedestrian then passes through non-maximum suppression, removing windows that overlap more than 65% with the maximum confidence window.