CN109002752A

CN109002752A - A kind of complicated common scene rapid pedestrian detection method based on deep learning

Info

Publication number: CN109002752A
Application number: CN201810021283.6A
Authority: CN
Inventors: 张峰
Original assignee: Beijing Tushi Technology Development Co ltd
Current assignee: Beijing Tushi Technology Development Co ltd
Priority date: 2018-01-08
Filing date: 2018-01-08
Publication date: 2018-12-14

Abstract

The complicated common scene rapid pedestrian detection method based on deep learning that the present invention relates to a kind of, it include: that pixel size pretreatment is carried out to training image and test image, pre-training is carried out to convolutional neural networks based on classification task, pedestrian detection training is carried out to convolutional neural networks based on pedestrian detection task, the lower prediction box of confidence level is eliminated using threshold filtering, eliminates the multiforecasting to same a group traveling together using non-maximum suppression.In pre-training, using cross entropy as loss function.The regression result of network output prediction pedestrian position box is finally made as loss function using improved mean square error.It is filtered using threshold filtering and non-maximum suppression output prediction results all to convolutional neural networks to get to the location information of detection pedestrian using image as the input of convolutional neural networks in test phase, is achieved in the intelligent monitoring of pedestrian.

Description

A kind of complicated common scene rapid pedestrian detection method based on deep learning

Technical field

The present invention relates to image processing techniques, more particularly to a kind of quick row of the common scene based on convolutional neural networks People's detection method.

Background technique

In recent years, monitoring camera was used in each public place, and the common scenes such as airport, station, hospital, road cover Thousands of monitoring camera has been covered, has detected the pedestrian in common scene for the exception of analysis flow of the people, the discovery stream of people Behavior track to specific crowd significant.It is past by manual analysis since the video data volume is huge and pedestrian is more It is past to be difficult to rapidly and accurately analyze target pedestrian.And often speed is slower for existing some automatic pedestrian detection methods, it cannot Complete the real time monitoring to pedestrian target.In order to realize the automatic real-time detection of pedestrian under common scene, one kind is studied public Scene rapid pedestrian detection method is significant.

Summary of the invention

In view of this, the main purpose of the present invention is to provide a kind of detection accuracy height, detection are fireballing based on convolution The common scene rapid pedestrian detection method of neural network, substantially increases detection accuracy, meanwhile, the reality to pedestrian may be implemented When detect.

In order to achieve the above object, a kind of technical solution proposed by the present invention are as follows: public field based on convolutional neural networks Scape rapid pedestrian detection method realizes that steps are as follows:

Step 1 reads database picture used in the training of tranining database picture, will using bilinear interpolation algorithm Its pixel size stretches or boil down to fixed size A × B.

Step 2, using tranining database, the pre-training based on classification task is carried out to convolutional neural networks.Picture will be adjusted The tranining database picture of plain size is compared by network output category result with input picture corresponding label as input Compared with calculating loss function.Loss function is minimized, pre-training is carried out to convolutional neural networks.

Step 3, the picture of database for reading special scenes are stretched its pixel size using bilinear interpolation algorithm Or boil down to fixed size.

Step 4 inherits weight obtained by pre-training network, changes convolutional neural networks end structure, uses special scenes Database, is directed to the task of pedestrian detection, is adjusted training to neural network.In training using the image of special scenes as The output of convolutional neural networks is carried out operation with the label of corresponding picture, calculates loss function by input.Minimize loss letter Number, training convolutional neural networks.

Video is resolved into single frame, then is calculated using bilinear interpolation by pedestrian's video under step 5, reading common scene Method stretches its pixel size or boil down to fixed size.

Step 6, using trained network, target detection is carried out to the pedestrian in picture.Pixel size will be adjusted Picture to be measured is input in existing network, the feature of target object in image is extracted by convolutional neural networks, finally by two The full articulamentum of layer exports the tensor of one 7 × 7 × (2 × 5) dimension.Tensor representation convolutional neural networks are pedestrian to be measured 98 prediction boxes out.

Step 7, set confidence level C threshold value C_threshold, to convolutional neural networks generate 98 prediction boxes into Row filtering.Give up the prediction box that confidence level C is less than given threshold C_threshold.

Step 8, it is filtered using non-maximum suppression prediction box higher to degree of overlapping.When different prediction box intersections When the ratio of area and union part area is more than defined threshold value IOU_threshold, then it is maximum only to retain confidence level C Prediction box, and other boxes are inhibited.Prediction the block data x, y, w retained, h, C are the target detected The spatial position coordinate and forecast confidence of pedestrian.

In the step 2, the process of pre-training convolutional neural networks is as follows:

The preceding 20 layers of convolutional layer and corresponding pond layer of network shown in step i) training Web vector graphic Fig. 1, then add later Upper one layer of mean value pond layer and full articulamentum.

Step ii) rgb space corresponding to tranining database picture fixed size pixel by converted magnitude A × B Input of the tensor data of × 3 dimensions as convolutional neural networks, exports the probability y for each classification results_i。

Step iii) calculate network output probability y_i' the cross entropy between label probabilityMake For loss function, loss function loss is minimized, pre-training is carried out to network.

In the step 4, the method finally trained to convolutional neural networks based on pedestrian detection progress is as follows:

Step i) retains the structure of preceding 20 layers of convolutional layer and corresponding pond layer in pre-training network, and inherits its corresponding power Value is increasing by 4 layers of convolutional layer and 2 layers of full articulamentum below, and is being randomly provided initial weight, keeps its network structure as shown in Figure 1.

Step ii) the full articulamentum of the last layer of network uses linear activation primitive: f (x)=x, and other full connection Layer and convolutional layer use the line rectification activation primitive (Leaky ReLu) with leakage: f (x)=max (x, 0.1x).

Step iii) training sample be convert size special scenes database picture and its corresponding label.It will figure Input of the tensor data that A × B × 3 of rgb space corresponding to piece A × B pixel is tieed up as convolutional neural networks.Neural network Output be 7 × 7 × (2 × 5) tie up tensor.Indicate the 98 prediction boxes done to measured target.Each box has x, y, This 5 data of w, h, C.

Step iv) database that reads special scenes correspond to the label of picture, and wherein pedestrian target is corresponding really for search Block data calculates prediction block data x, y, w, h, C corresponding label data x ', y ', w ', h ' with network output, C '=P × IOU.In calculating process, whole uniform picture is divided into 7 × 7 grid by imagination, if the true box of pedestrian target Centre coordinate is fallen in some grid, then generates a group of labels x ', y ', w ', h ', C '=P × IOU.X ', y ' are true box The coordinate of central point, value between 0~1, if the coordinate of true box central point where pedestrian in the lower left corner of corresponding grid, Then its value is (0,0), if it in the upper right corner of grid, value is (1,1).W ', h ' are the length and width of true box, and value exists Between 0~1, if the length of box or wide corresponding pixel size are 0, value 0, if the length of box or the corresponding pixel of width are big Small is 448, then its value is 1.C '=P × IOU, wherein P=1, IOU are to predict box x, y, w, h and true box x ', y ', w ', The intersection and union area ratio of h ' expression range.

Step v) calculates predicted value x, y, w, and h, C and the corresponding improvement of label value x ', y ', w ', h ', C '=P × IOU are square Error loss function:

Wherein λ_coord=5, λ_noord=0.5, i indicate i-th in 7 × 7 grid, and j indicates that 2 of each grid are pre- Survey j-th in box.If the centre coordinate of pedestrian target falls in i-th of grid and corresponding j-th of the prediction side of the grid Frame and true pedestrian place box have maximum IOU, thenAndOtherwiseAndIt is minimum Change loss function loss, network is trained.

In conclusion a kind of common scene rapid pedestrian detection method based on convolutional neural networks of the present invention, Include: that single frames decomposition is carried out to the video under common scene, in decompositing the video frame come, uses the method for bilinear interpolation Picture is converted into fixed pixel size.The training of pedestrian detection network is divided into pre-training and finally two processes of training, In pre-training, use tranining database as training sample, be trained based on classification task, definition intersects entropy function as damage Function is lost, whole network is trained by loss function, in final training process, inherits most of structure of pre-training network And weight, network is improved, recurrence task is based on using the database of special scenes and is trained, is defined improved square Error function trains whole network by minimizing loss function as loss function.During the test, with big after conversion Small video frame exports all target pedestrian's prediction results by neural network as input, and output result is used threshold value Filtering and non-maximum suppression are filtered, and are finally obtained the box for outlining pedestrian position, are achieved in the quick detection of pedestrian.

The advantages of the present invention over the prior art are that:

(1) present invention uses single convolutional neural networks, extracts to video frame images feature, in the detection of pedestrian In the process, using video frame images as input, target line people position is outlined by the processing directly output of convolutional neural networks Box.In trained and test process, it is all made of method end to end, therefore it is fast to detect speed.This method can be widely applied In the complex scenes such as community, hospital, airport, station, school, to target, pedestrian makes real-time detection.

(2) present invention extracts the feature of video frame images using the convolutional neural networks in deep learning, uses training number It is trained according to the database of library and special scenes.In the training process, convolutional neural networks have learnt various pedestrian's mesh Target posture, and the high dimensional feature of pedestrian target is therefrom constantly extracted and learns, therefore, this method has generalization ability strong, Shandong The strong feature of stick, can be applied to a variety of different scenes, and the target pedestrian different to macroscopic features detects.This

Detailed description of the invention

For implementation flow chart of the present invention.

Specific embodiment

To make the object, technical solutions and advantages of the present invention clearer, right below in conjunction with the accompanying drawings and the specific embodiments The present invention is described in further detail.

The common scene intelligent video monitoring method that a kind of view-based access control model conspicuousness and depth of the present invention encode certainly, Include: that single frames decomposition is carried out to the video under common scene, in decompositing the video frame come, uses the method for bilinear interpolation Picture is converted into fixed pixel size.The training of pedestrian detection network is divided into pre-training and finally two processes of training, In pre-training, use tranining database as training sample, be trained based on classification task, definition intersects entropy function as damage Function is lost, whole network is trained by loss function, in final training process, inherits most of structure of pre-training network And weight, network is improved, recurrence task is based on using the database of special scenes and is trained, is defined improved square Error function trains whole network by minimizing loss function as loss function.During the test, with big after conversion Small video frame exports all target pedestrian's prediction results by neural network as input, and output result is used threshold value Filtering and non-maximum suppression are filtered, and are finally obtained the box for outlining pedestrian position, are achieved in the quick detection of pedestrian.

As shown in Figure 1, the present invention is implemented as follows step:

Step 1 reads database picture used in the training of tranining database picture, will using bilinear interpolation algorithm Its pixel size stretches or boil down to A × B.

Step 2, using ImageNet database, the pre-training based on classification task is carried out to convolutional neural networks.It will adjust The ImageNet database picture of whole pixel size is as input, by network output category result, mark corresponding with input picture Label are compared, and calculate loss function.Loss function is minimized, pre-training is carried out to convolutional neural networks.

Step 3, the picture of database for reading special scenes are stretched its pixel size using bilinear interpolation algorithm Or boil down to A × B.

Video is resolved into single frame, then is calculated using bilinear interpolation by pedestrian's video under step 5, reading common scene Method stretches its pixel size or boil down to A × B.

Step ii) A × B × 3 of rgb space corresponding to A × B pixel of tranining database picture by converted magnitude ties up Input of the tensor data as convolutional neural networks, export the probability y for each classification results_i。

Step iii) training sample be convert size special scenes database picture and its corresponding label.It will figure Input of the tensor data of 448 × 448 × 3 dimensions of rgb space corresponding to 448 × 448 pixel of piece as convolutional neural networks. The output of neural network is that tensor is tieed up in 7 × 7 × (2 × 5).Indicate the 98 prediction boxes done to measured target.Each side Frame has x, y, w, this 5 data of h, C.

In conclusion the above is merely preferred embodiments of the present invention, being not intended to limit the scope of the present invention. All within the spirits and principles of the present invention, any modification, equivalent replacement, improvement and so on should be included in of the invention Within protection scope.

Claims

1. a kind of complicated common scene rapid pedestrian detection method based on deep learning, it is characterised in that realize that steps are as follows:

Step 1 reads database picture used in the training of ImageNet database picture, will using bilinear interpolation algorithm Its pixel size stretches or boil down to A × B.

Step 2, using tranining database, the pre-training based on classification task is carried out to convolutional neural networks.It is big pixel will to be adjusted Small tranining database picture is compared with input picture corresponding label, is counted by network output category result as input Calculate loss function.Loss function is minimized, pre-training is carried out to convolutional neural networks.

Its pixel size is stretched or is pressed using bilinear interpolation algorithm by step 3, the picture of database for reading special scenes It is condensed to A × B.

Step 4 inherits weight obtained by pre-training network, changes convolutional neural networks end structure, uses the data of special scenes Library, is directed to the task of pedestrian detection, is adjusted training to neural network.Using the image of special scenes as defeated in training Enter, the output of convolutional neural networks is subjected to operation with the label of corresponding picture, calculates loss function.Loss function is minimized, Training convolutional neural networks.

Video is resolved into single frame, then uses bilinear interpolation algorithm by pedestrian's video under step 5, reading common scene, will Its pixel size stretches or boil down to A × B.

Step 6, using trained network, target detection is carried out to the pedestrian in picture.The to be measured of pixel size will be adjusted Picture is input in existing network, and the feature of target object in image is extracted by convolutional neural networks, complete finally by two layers Articulamentum exports the tensor of one 7 × 7 × (2 × 5) dimension.What tensor representation convolutional neural networks made pedestrian to be measured 98 prediction boxes.

Step 7, the threshold value C_threshold for setting confidence level C, the 98 prediction boxes generated to convolutional neural networks carried out Filter.Give up the prediction box that confidence level C is less than given threshold C_threshold.

Step 8, it is filtered using non-maximum suppression prediction box higher to degree of overlapping.When different prediction box intersections part When the ratio of area and union part area is more than defined threshold value IOU_threshold, then it is maximum pre- only to retain confidence level C Box is surveyed, and other boxes are inhibited.Prediction the block data x, y, w retained, h, C are the target pedestrian detected Spatial position coordinate and forecast confidence.

2. a kind of pedestrian detection method based on single convolutional neural networks according to claim 1, it is characterised in that: described In step 2, the process of pre-training convolutional neural networks is as follows:

The preceding 20 layers of convolutional layer and corresponding pond layer of network shown in step i) training Web vector graphic Fig. 1, then one is added later Layer mean value pond layer and full articulamentum.

Step ii) rgb space corresponding to 224 × 224 pixels of tranining database picture by converted magnitude 224 × 224 Input of the tensor data of × 3 dimensions as convolutional neural networks, exports the probability y for each classification results_i。

Step iii) calculate network output probability y_i' the cross entropy between label probabilityAs damage Function is lost, loss function loss is minimized, pre-training is carried out to network.

3. a kind of pedestrian detection method based on single convolutional neural networks according to claim 1, it is characterised in that: described In step 4

Step i) retains the structure of preceding 20 layers of convolutional layer and corresponding pond layer in pre-training network, and inherits its corresponding weight, Increasing by 4 layers of convolutional layer and 2 layers of full articulamentum below, and be randomly provided initial weight, is keeping its network structure as shown in Figure 1.

Step ii) the full articulamentum of the last layer of network uses linear activation primitive: f (x)=x, and other full articulamentum and Convolutional layer uses the line rectification activation primitive (Leaky ReLu) with leakage: f (x)=max (x, 0.1x).

Step iii) training sample be convert size special scenes database picture and its corresponding label.By picture A Input of the tensor data of 448 × 448 × 3 dimensions of rgb space corresponding to × B pixel as convolutional neural networks.Nerve net The output of network is that tensor is tieed up in 7 × 7 × (2 × 5).Indicate the 98 prediction boxes done to measured target.Each box has x, This 5 data of y, w, h, C.

Step iv) database that reads special scenes correspond to the label of picture, search for the wherein corresponding true box of pedestrian target Data calculate prediction block data x, y, w, h, C corresponding label data x ', y ', w ', h ', C '=P with network output ×IOU.In calculating process, whole uniform picture is divided into 7 × 7 grid by imagination, if the center of the true box of pedestrian target Coordinate is fallen in some grid, then generates a group of labels x ', y ', w ', h ', C '=P × IOU.X ', y ' are true box center The coordinate of point, value between 0~1, if the coordinate of true box central point where pedestrian in the lower left corner of corresponding grid, Value is (0,0), if it in the upper right corner of grid, value is (1,1).W ', h ' are the length and width of true box, and value is 0~1 Between, if the length of box or wide corresponding pixel size are 0, value 0, if the length of box or the corresponding pixel size of width are 448, then its value is 1.C '=P × IOU, wherein P=1, IOU are prediction box x, y, w, h and true box x ', y ', w ', h ' table The intersection and union area ratio that demonstration is enclosed.

Step v) calculates predicted value x, y, w, h, C and the corresponding improvement mean square error of label value x ', y ', w ', h ', C '=P × IOU Loss function:

Wherein λ_coord=5, λ_noord=0.5, i indicate i-th in 7 × 7 grid, and j indicates 2 prediction boxes of each grid In j-th.If the centre coordinate of pedestrian target falls in i-th of grid and corresponding j-th of the prediction box of the grid and true Box where carrying out people has maximum IOU, thenAndOtherwiseAndMinimize loss Function loss, is trained network.