CN106845430A

CN106845430A - Pedestrian detection and tracking based on acceleration region convolutional neural networks

Info

Publication number: CN106845430A
Application number: CN201710066312.6A
Authority: CN
Inventors: 叶国林; 孙韶媛; 高凯珺; 姚光顺
Original assignee: Donghua University
Current assignee: Donghua University; National Dong Hwa University
Priority date: 2017-02-06
Filing date: 2017-02-06
Publication date: 2017-06-13

Abstract

The present invention relates to a kind of pedestrian's recognition and tracking method based on acceleration region convolutional neural networks, first training, test data set are gathered by being loaded with the robot of infrared camera at night, training, test data set are pre-processed on request, then all training and test pictures is carried out during locations of real targets marks and recorded sample file；Acceleration region convolutional neural networks are built again, acceleration region convolutional neural networks are trained using training dataset, obtain the bounding box of the last probability and region that belong to pedestrian area using non-maxima suppression algorithm to network output；Using the degree of accuracy of test data set test network, satisfactory network model is obtained；The picture of night robot collection is input into acceleration region convolutional neural networks model, output in real time belongs to the probability of pedestrian area and the bounding box in region to model online.The present invention can efficiently identify the pedestrian in infrared image, and the pedestrian target in infrared video can be tracked in real time.

Description

Pedestrian detection and tracking based on acceleration region convolutional neural networks

Technical field

The present invention relates to a kind of night robot pedestrian detection based on acceleration region convolutional neural networks and tracking, The method belongs to infrared night vision image processing field, by the method robot can be realized in night detect and track in real time Pedestrian.

Background technology

With developing rapidly for robot technology and infrared imagery technique, the application field that both combine is also more extensive.Example Such as, night using robot carry out pedestrian detection with tracking, reach detective with monitoring effect.As the reality higher of robot Existing, in night running, pedestrian is also the main object of its detection to Unmanned Systems.But infrared image is in itself gray-scale map Picture, colourless multimedia message, grain details are few, the characteristics of signal to noise ratio is low, so the pedestrian detection in infrared image is to live very much with tracking The research field of jump.

In pedestrian's follow-up study, Yasuno et al. (M.Yasuno, S.Ryousuke, N.Yasuda, Pedestrain Detection and Tacking in Far Infrared Images[C].In Proceedings of IEEE Conference on Intelligent Transpotation Systems, 2005：182-187.), by tracing area Stencil matching is inside carried out to track the position of head.Dai et al. (X.Dai, F.zheng, X.Liu.Layered representation for pedestrain detection and tracking in infrared imagery[J] .IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005,3 (1)：13-18.) think that human body four limbs deformation in motion process is larger, have impact on the performance of tracking.In order to remove four The influence of limb, therefore only head is tracked with body.The current infrared pedestrian tracking algorithm for proposing, is all exactly right Human body is a certain or certain several position is tracked, and is tracked rather than to whole pedestrian.

For a long time, the most popular method of pedestrian detection is the method based on pedestrian's feature extraction and machine learning.Wang Lei (pedestrian detection algorithm research [D] the HeFei University of Technologys in Wang Lei infrared images, 2015：26-44.) use and first extract positive and negative The feature of sample, positive negative sample here refers respectively to the picture comprising pedestrian and the picture not comprising pedestrian, training classification Device, then travels through the complete image of a width with slip window sampling, the grader that recycling is trained window is carried out pedestrian with it is non- The discriminant classification of pedestrian, reaches the purpose of pedestrian detection.Although this method can obtain preferable testing result, due to When this method carries out pedestrian detection to entire image, using be that multiple dimensioned sliding window is traveled through to entire image, Generate substantial amounts of detection window, and feature extraction carried out to all of detection window successively, result in the sharp increase of amount of calculation, Speed is extremely slow.

In recent years, depth convolutional neural networks were developed rapidly, in image classification, natural language processing and target detection etc. Huge success is achieved using upper.It is advantageous that extract the feature of image and classified, it is excellent in order to give full play to its (IRSHICKR, DONAHUEJ, NAJMANETL, et al.Rich feature the hierarchies for such as gesture, Girshick accurate object detection and semantic segmentation[C].IEEE Conference on Computer Vision and Pattern Recognition, 2014：580-587.) propose region convolutional neural networks (R-CNN) target detection problems of image are converted into classification problem by framework, achieve Detection results well.The base of the method This thought is first to extract several candidate target rectangular areas in the picture, then each candidate region is carried with depth convolutional network Target signature is taken, finally a grader is trained with SVMs, candidate target region is classified.According to each region Classification score goes out final object boundary using non-maxima suppression algorithm optimization.However, candidate region therein is not to use again Multi-scale sliding window mouthful before is obtained, but uses the selective search algorithm based on layering and many similarity measurements to generate About 2000 multi-level candidate frames.

The convolutional network and the grader for classifying that R-CNN extracts feature will be separated and trained, and result in training process will Take a substantial amount of time and memory space；And the training of grader is uncorrelated to feature extraction network, this be also it is irrational, Have impact on the accuracy rate of target detection.Therefore Girshick (R.Girshick.Fast-RCNN.IEEE International Conference on Computer Vision, 2015.) have also been proposed fast area convolutional neural networks Fast-RCNN moulds Type, a taxonomy model is entered by feature extraction and fusion for classification, improves the speed of training pattern and the accuracy rate of target detection.

It is non-due to being individually created candidate region using selective search algorithm although Fast-RCNN has improvement Often time-consuming, this is that the algorithm is unable to reach real-time fatal reason.

The content of the invention

The technical problem to be solved in the present invention is how to realize real-time pedestrian detection with tracking using robot at night.

For pedestrian tracking algorithm, if the discrimination of i.e. pedestrian detection algorithm is high, then in detecting infrared video All pedestrians of each frame, provide the positional information of whole pedestrian, rather than a part for human body.If additionally, pedestrian detection Algorithm has real-time, so can easily realize pedestrian tracking.So, it is of the invention it is important that how to realize identification high Rate, the pedestrian detection of real-time.As long as realizing above-mentioned efficient pedestrian detection, then also when the water comes, a channel is formed for pedestrian tracking.

In order to solve the above-mentioned technical problem, the technical scheme is that providing a kind of based on acceleration region convolutional Neural net The pedestrian detection and tracking of network, it is characterised in that comprise the following steps：

Step 1：Two groups of infrared pictures are gathered at night by the robot for being loaded with infrared camera, one group infrared picture is made It is training dataset, another group of infrared picture is used as test data set；To training dataset and all pictures of test data set Name is carried out in accordance with regulations, and makes the picture name list of training dataset and test data set；

Step 2：Locations of real targets mark, Ji Jiangsuo are carried out to all pictures that training dataset and test data are concentrated There are all pedestrian targets in picture to be gone out with collimation mark, by the number of pedestrian in picture and the upper left bottom right 4 of the bounding box of pedestrian Individual coordinate information recorded in sample file；

Step 3：Acceleration region convolutional neural networks are built, is trained using the picture and sample file of training dataset and accelerated Region convolutional neural networks；Acceleration region convolutional neural networks include advising network for the region for extracting candidate region and are used for The convolutional neural networks of pedestrian detection, advise that network selects several candidate regions by region, then by these candidate regions Input to convolutional neural networks, convolutional neural networks export the score that these candidate regions are pedestrians and its bounding box refine it Coordinate points afterwards；The output of convolutional neural networks is obtained using non-maxima suppression algorithm and last belongs to the general of pedestrian area Rate and the bounding box in region；

Step 4：The acceleration region convolutional Neural trained using the picture and sample file testing procedure 3 of test data set Network, if being unsatisfactory for error requirements, the re -training of return to step 3, untill error requirements are met；Obtaining meeting precision will The acceleration region convolutional neural networks model asked；

Step 5：The acceleration region convolutional neural networks model that step 4 is set up is used for online night robot row in real time People detect with tracking, will night robot collection picture input acceleration region convolutional neural networks model, model is real online When output belong to pedestrian area probability and region bounding box.

Preferably, the acceleration region convolutional neural networks be a series of convolution, excitation, pond and full connection procedure, Using ZF frameworks, the framework includes that network and target identification network, and region suggestion network and target identification network are advised in region In characteristic pattern extract part use parameter sharing mechanism.

The present invention can be used for robot and unmanned vehicle and carry out real-time by infrared camera in the case where night is unglazed Pedestrian detection and tracking.The present invention by acceleration region convolutional neural networks be applied to the real-time pedestrian detection of infrared video with Tracking, without generating candidate region using other method in advance, without choosing pedestrian's feature by hand, by training end to end, Directly input an infrared picture, the pedestrian position in output picture.The invention ensure that in infrared video pedestrian detection and with The correctness and real-time of track.

The method that the present invention is provided by using acceleration region convolutional neural networks, without be individually created candidate region and Pedestrian's feature is chosen by hand, and candidate region generation is realized that realization is operated end to end, and the method is bright also by convolutional network The aobvious speed for accelerating pedestrian's identification, improves the correctness of identification.

Brief description of the drawings

Fig. 1 is the night vision image pedestrian's identification process figure based on acceleration region convolutional neural networks；

Fig. 2 is acceleration region convolutional neural networks structure chart.

Specific embodiment

With reference to specific embodiment, the present invention is expanded on further.It should be understood that these embodiments are merely to illustrate the present invention Rather than limitation the scope of the present invention.In addition, it is to be understood that after the content for having read instruction of the present invention, people in the art Member can make various changes or modifications to the present invention, and these equivalent form of values equally fall within the application appended claims and limited Scope.

A kind of night robot pedestrian detection and tracking based on acceleration region convolutional neural networks, including following step Suddenly：

Step 1：Build night vision image training and test data set.Using laboratory be loaded with the robot of infrared camera from Row collection experiment picture, used as training dataset, 200 infrared pictures are used as test data set, every for 2000 infrared pictures Picture size is 720*576.All pictures to training dataset and test data set are renamed by regulation, and make training The picture name list of data set and test data set.

Step 2：Marking program is write with Python, locations of real targets mark manually is carried out to all training and test pictures Note, all pedestrian targets that will be in all pictures are gone out with collimation mark, by the bounding box of the number of pedestrian in picture and pedestrian The coordinate record of upper left bottom right 4 is in .xml.

Step 3：Build acceleration region convolutional neural networks, repetitive exercise.Using ready-made training in step 1 and step 2 Collection training acceleration region convolutional neural networks, wherein acceleration region convolutional neural networks include convolutional layer, the region of shared parameter Suggestion network and convolutional network.The convolutional layer of shared parameter is used for the extraction of characteristic pattern, and this feature figure is fed to region simultaneously In suggestion network and in convolutional network.Region suggestion network calculates candidate region for study, and these candidate regions are also input into To in convolutional network.Last convolutional network is used to predict score and its bounding box essence for exporting that these candidate regions are pedestrians Repair the coordinate points after (recurrence).

Fig. 1 is the night vision image pedestrian's identification process figure based on acceleration region convolutional neural networks.Firstly the need of to infrared During real pedestrian position in image is marked and recorded text.Then acceleration region convolutional neural networks are built, will The infrared picture of training true pedestrian position file corresponding with per pictures is put into the network of structure and is learnt.Iteration After practising certain number of times, the model parameter of network is obtained.Then input test image, acceleration region convolutional neural networks can be according to preceding The model parameter that face training is obtained carries out pedestrian's identification to test image, finally gives the side of all pedestrians in test night vision image Boundary's frame.

Fig. 2 is acceleration region convolutional neural networks structure chart；Acceleration region convolutional neural networks mainly include three parts： The convolutional layer of shared parameter, region suggestion network and convolutional network.The convolutional layer of shared parameter is used for the extraction of characteristic pattern, the spy Figure is levied while being fed in region suggestion network and in convolutional network.Region suggestion network calculates candidate regions for study Domain, these candidate regions are also entered into convolutional network.Last convolutional network be used for predict be probably pedestrian position area Domain, and output loss is calculated with actual pedestrian position, for updating network parameter.

The acceleration region convolutional neural networks that the present invention is used were waited for a series of convolution, excitation, pond and full connection Journey, using ZF frameworks, the framework includes that network RPN and target identification network Fast-RCNN, and RPN and Fast- are advised in region Characteristic pattern in RCNN networks extracts convolutional layer of the part using parameter sharing mechanism.

The convolutional layer for being used for characteristic pattern extraction in the present invention has 5.Assuming that convolutional layer is f, parameter is θ, then the mathematics of f Expression formula is：

f(X；θ)=W_LH_L-1

Wherein, H_lIt is the l layers of output of Hidden unit, b_lIt is l layers of deviation, W_lIt is l layers of weights, and b_lAnd W_lComposition can The parameter θ of training, pool () represents pondization operation, and characteristic point integration that will be in small neighbourhood obtains new feature so that feature subtracts Few, parameter is reduced, and pond unit has translation invariance.The method in pond mainly includes average-pondization and maximum-pond Change, the present invention is main using maximum-pondization operation.Relu () is represented and is made a nonlinear transformation to characteristic pattern so that wanted Information by and filter out undesired information.L is the integer not less than 1.Last convolutional layer has 256 convolution kernels, So characteristic pattern has 256, characteristic dimension is 256 dimensions, and each characteristic pattern size is about 40*60, these characteristic patterns are inputed to Advise the convolutional network of network and target identification in region.The convolutional layer parameter configuration that characteristic pattern is extracted is as shown in table 1.

The feature extraction convolutional layer parameter configuration of table 1

In region suggestion network, with the sliding window sliding characteristics figure of 3*3, when sliding window slides into each position, prediction input 3 kinds of yardsticks (128,256,512) of image and 3 kinds of length-width ratios (1: 1,1: 2,2: candidate region 1), so each sliding position Just there are 9 candidate regions, piece image can generate about 2000 (40*60*9) individual candidate regions.Two points are connect behind convolutional layer The full articulamentum of branch, one is that classification layer (cls-layer) exports 2 scores, for judging that candidate region is target or the back of the body Scape, another is that border returns layer (reg-layer) 4 scores of output, is finely adjusted for the border to candidate region, so 9 candidate regions on a position, full * 9 results of articulamentum final output (2+4).Although being selected by region suggestion network The candidate region for taking there are about 2000, but the invention has been screened first 300 and has been input to mesh according to the score height of candidate region Other convolutional network is identified, can so accelerate speed.

To in the convolutional network of target identification, identification network uses Fast-RCNN networks to input candidate frame, removes ginseng Outside the convolutional layer of the shared extraction features of number, behind connect full articulamentum and the excitation that two convolution check figures are 4096 successively Layer, is output as 2 classification layer, and the border for being output as 4 returns layer and loss layer.

When region suggestion network is trained, a binary label is distributed to each candidate region, positive label can be distributed To two class candidate regions：(1) there are the Chong Die candidates of highest IoU (the ratio between common factor union) with certain real goal (GT) bounding box Region (perhaps less than 0.7), (2) have the overlapping candidate regions of the IoU more than 0.7 with any GT bounding boxes.One GT bounding box Positive label may be distributed to multiple candidate regions.And negative label is then distributed to and is below 0.3 with the IoU ratios of all GT bounding boxes Candidate region.The candidate region of anon-normal non-negative does not have any effect to training objective.

As Fast R-CNN, also in compliance with multitask loss during the suggestion network training of region, object function is minimized.One The loss function of individual image is defined as：

Wherein, i is the index of candidate region in training batch (mini-batch), p_iIt is that i-th candidate region is The prediction probability of target.If candidate region is just, GT labelsIt is then 1, conversely,It is 0.t_iIt is a vector, i.e. t_i= (t_x, t_y, t_w, t_h), 4 parametrization coordinates of the bounding box of prediction are represented,It is GT bounding boxes corresponding with positive candidate region Coordinate vector, i.e.,Classification Loss L_clsIt is the logarithm loss of two classifications (target and non-targeted), its In, i is the index of candidate region in training batch (mini-batch), p_iIt is that i-th candidate region is the prediction of target Probability.If candidate region is just, GT labelsIt is then 1, conversely,It is 0.λ is balance weight, and 10, N are taken as in the present invention_cls It is the size of mini-batch, i.e., 256, N_regBe the quantity of candidate region, i.e., about 2400.Classification Loss L_clsIt is two classifications The logarithm loss of (target and non-targeted), i.e.,：For returning loss L_reg, useTo calculate.R is the loss function (smooth with robustness_L1), it is defined as：

p_i*L_regThis means only positive candidate regionJust there is recurrence to lose, other situations just do not have

For returning, the present invention uses 4 coordinates：

t_x=(x-x_a)/w_a, t_y=(y-y_a)/h_a, t_w=log (w/w_a), t_h=log (h/h_a),

Wherein (t_x, t_y, t_w, t_h) represent that 4 of predicted boundary frame parameterize coordinate vectors,Represent 4 parametrization coordinate vectors of GT bounding boxes corresponding with positive candidate region, above-mentioned two vector is used for counting loss.X, y, w, H refers to the centre coordinate (x, y) of predicted boundary frame, wide and height respectively；x_a, y_a, w_a, h_aRefer to the center of candidate region bounding box respectively Coordinate (x_a, y_a), it is wide and high；x^*, y^*, w^*, h^*Refer to the centre coordinate (x of GT bounding boxes respectively^*, y^*), it is wide and high.Can be understood as Returned from candidate region bounding box to the bounding box of neighbouring GT bounding boxes.

It is above-mentioned be region advise network loss function, and the convolutional network of target identification still using Fast-RCNN it The loss function of itself.It is of the invention by the way of alternately training when whole network is trained, i.e.,：

(1) network is advised according to above-mentioned region, the network model initialization of ImageNet pre-training, and it is end-to-end Fine setting region suggestion network parameter is extracted for candidate frame, the stage iteration 80000 times.

(2) candidate region of the generation of the first step is utilized, individually a detection network, Fast is trained by Fast R-CNN R-CNN detection networks are equally that at this time two networks are also without shared volume by the model initialization of ImageNet pre-training Lamination, the stage iteration 40000 times.

(3) carry out region again with detection network Fast R-CNN and advise network training, but fixed shared convolutional layer, and Only the exclusive layer of network, present two network share convolutional layers, the stage iteration 80000 times are advised in fine setting region.

(4) shared convolutional layer is kept to fix, other layers of fine setting Fast R-CNN.So, two network shares are identical Convolutional layer, constitute a unified network, the stage iteration 40000 times.

By above-mentioned iterative learning, you can draw network parameter.

According to the above-mentioned model parameter for training, one infrared picture of input is that exportable 300 candidate regions are targets Probability and boundary coordinate, recycle non-maxima suppression algorithm obtain the last probability for belonging to pedestrian area and region Bounding box.

Step 4：The acceleration region convolutional Neural trained using the picture and sample file testing procedure 3 of test data set Network, meets error requirements, obtains meeting the acceleration region convolutional neural networks model of required precision；

Experiment shows that the acceleration region convolutional neural networks used in the present invention have very to pedestrian's identification in night vision image Good effect, discrimination is high, and real-time is good.

Claims

1. a kind of pedestrian detection and tracking based on acceleration region convolutional neural networks, it is characterised in that including following step Suddenly：

Step 1：Two groups of infrared pictures are gathered at night by the robot for being loaded with infrared camera, one group of infrared picture is used as instruction Practice data set, another group of infrared picture is used as test data set；All pictures to training dataset and test data set press rule Surely it is named, and makes the picture name list of training dataset and test data set；

Step 2：Locations of real targets mark is carried out to all pictures that training dataset and test data are concentrated, will all figures All pedestrian targets in piece are gone out with collimation mark, by the seat of upper left bottom right 4 of the number of pedestrian in picture and the bounding box of pedestrian Mark information record is in sample file；

Step 3：Acceleration region convolutional neural networks are built, acceleration region is trained using the picture and sample file of training dataset Convolutional neural networks；Acceleration region convolutional neural networks include advising network and for pedestrian for the region for extracting candidate region The convolutional neural networks of detection, advise that network selects several candidate regions, then these candidate regions are input into by region To convolutional neural networks, convolutional neural networks are exported after the score that these candidate regions are pedestrians and its bounding box refine Coordinate points；By the output of convolutional neural networks using non-maxima suppression algorithm obtain the last probability for belonging to pedestrian area with And the bounding box in region；

Step 4：The acceleration region convolutional Neural net trained using the picture and sample file testing procedure 3 of test data set Network, if being unsatisfactory for error requirements, the re -training of return to step 3, untill error requirements are met；Obtain meeting required precision Acceleration region convolutional neural networks model；

Step 5：The acceleration region convolutional neural networks model that step 4 is set up is used for online night robot pedestrian inspection in real time Survey with tracking, will night robot collection picture input acceleration region convolutional neural networks model, model is online defeated in real time Go out to belong to the probability of pedestrian area and the bounding box in region.

2. the pedestrian detection and tracking of acceleration region convolutional neural networks are based on as claimed in claim 1, and its feature exists In：The acceleration region convolutional neural networks are a series of convolution, excitation, pond and full connection procedure, using ZF frameworks, should Framework includes that region advises that the characteristic pattern in network and target identification network, and region suggestion network and target identification network is extracted Part uses parameter sharing mechanism.