CN109840498A

CN109840498A - A kind of real-time pedestrian detection method and neural network, target detection layer

Info

Publication number: CN109840498A
Application number: CN201910095995.7A
Authority: CN
Inventors: 胡永健; 阿尔法西·萨吉尔·艾哈迈德·萨吉尔; 刘琲贝; 王宇飞
Original assignee: South China University of Technology SCUT; Sino Singapore International Joint Research Institute
Current assignee: South China University of Technology SCUT; Sino Singapore International Joint Research Institute
Priority date: 2019-01-31
Filing date: 2019-01-31
Publication date: 2019-06-04
Anticipated expiration: 2039-01-31
Also published as: CN109840498B

Abstract

The invention discloses a kind of real-time pedestrian detection methods, the step of this method, specifically includes that determining default resolution, video frame is read, segmentation block number is determined according to zoom factor, adjusts video frame size, divide video frame, segmentation rear video frame sub-block is stacked and is extracted feature, the coordinate parameters and pedestrian's confidence of predicting candidate pedestrian's frame filter out final pedestrian's frame result, according to present frame pedestrian's size adjusting zoom factor, next frame is continued with until completing whole Detection tasks.The invention discloses a kind of neural networks, including 7 or 8 or 9 layers of convolutional layer.The invention also discloses a kind of target detection layer, which realizes that two parts function is predicted in the prediction of pedestrian target frame coordinate and target frame confidence level.The present invention carries out self adaptive pantographic to video frame by zoom factor, in the case where guaranteeing detection accuracy and arithmetic speed, has been improved particularly the detection effect to small size pedestrian target.

Description

A kind of real-time pedestrian detection method and neural network, target detection layer

Technical field

It is the present invention relates to deep learning technical field of video processing, in particular to a kind of based on depth convolutional neural networks Real-time pedestrian detection method and neural network, target detection layer.

Background technique

Target detection is a kind of important computer vision technique, wherein pedestrian detection algorithm is in intelligent robot, video The forward positions popular domain such as monitoring and automatic Pilot is with a wide range of applications, by the academic attention with industrial circle.Past ten Many pedestrian detection methods are invented between many years, but there are also numerous actual application problems are urgently to be resolved.Pedestrian detection is in computer Visual field is still an extremely challenging task.

Conventional pedestrian's detection algorithm is mostly based on hand-designed feature, such as SIFT, SURF and HOG feature etc..With depth The development of learning art, the effective convolutional neural networks (Convolutional especially in image analysis tasks Neural Network, CNN) since invention, start to realize pedestrian's identification and detection using deep learning algorithm.Cai et al. exists 2016 European Computer visual conference (ECCV2016) publish thesis " A unified multi-scale deep Convolutional neural network for fast object detection ", utilize the different convolutional layers in CNN The image for matching different scale, carries out the Detection task under different scale to combine end-to-end training.It is detected compared to conventional pedestrian Algorithm, which can be improved Detection accuracy, but the recognition speed is slower, be only capable of reaching using a piece of tall and handsome Titan model GPU that reaches To 15 frames/second detection speed, it is difficult to meet requirement of real-time.Du et al. applies meeting in IEEE winter computer vision in 2017 View (WACV2017) publishes thesis " Fused DNN:A deep neural network fusion approach to fast And robust pedestrian detection ", Detection accuracy is improved using multiple parallel C NN, but due to network parameter Excessively, this method detection speed is slower, can only achieve about 3 frames/second detection speed using a piece of tall and handsome TitanX model GPU that reaches Degree.Brazil et al. 2017 international computer visual conference (ICCV2017) publish thesis " Illuminating Pedestrian via simultaneous detection and segmentation ", by the detection of sharing feature and Divide network, preferably realizes the pedestrian detection task in the stream of people.But due to the complicated network structure, it is empty to consume a large amount of storages Between, detection speed is also difficult to meet requirement of real time.Other than computing cost is big, real-time is poor, above-mentioned several method for from Video camera farther out, the lesser pedestrian of size will appear a large amount of missing inspections, the testing requirements for making it be difficult to meet practical application scene.

In practical applications, since background complexity is different, pedestrian's appearance is different (different sizes or garment language), light The problems such as line/weather condition is different and partial occlusion, the pedestrian detection method based on deep learning are generally required using complexity Neural network can be only achieved Detection accuracy requirement, cost is to increase algorithm complexity, reduces algorithm real-time.

Summary of the invention

The purpose of the present invention is to overcome the shortcomings of the existing technology and deficiency, provides a kind of real-time pedestrian detection method, this Method is directed to the quick inspection of sizes pedestrian target by realizing from zoom technology under the premise of guaranteeing Detection accuracy It surveys, improves algorithm real-time.

The purpose of the present invention is realized by the following technical solution: a kind of real-time pedestrian detection method, according to row in video The size of people is automatically split video frame, and single iteration is carried out in single width video frame, exports pedestrian target frame and pedestrian Confidence realizes efficient detection；Include the following steps:

Determine the default resolution of network reception video in algorithm: H_d×W_d× 3, wherein H_d、W_dRespectively refer to the height of image And width, the color channel number that 3 finger images include；

Present frame I is read, resolution ratio is H × W × 3；

According to the value of zoom factor z, the segmentation block number B of present frame I is determined；

According to zoom factor z and segmentation block number B, the size of adjustment present frame I is H' × W'；

Pixel value in frame after normalization adjustment size；

Frame after segmentation normalization is B subgraph；

The subgraph that present frame is divided is according to (B, H_d,W_d, 3) dimension arrangement, carry out feature extraction, and obtain The pedestrian target frame coordinate of characteristic pattern confidence level corresponding with the frame；

Valid frame is screened from target frame, the target frame of reservation and its corresponding pedestrian's classification confidence level can be used as pedestrian The output result of detection；

Calculate the average height H of all pedestrians detected in present frame_ped, and set minimum and highest threshold value H_{θ_min}With H_{θ_max}If H_ped< H_{θ_min}, then zoom factor z is increased by 1, if H_ped> H_{θ_max}, then zoom factor z is reduced 1, other situations Then keep zoom factor constant；Detection next frame video is repeated, until whole section of video detection finishes.

Preferably, the segmentation block number B of the determining present frame I method particularly includes:

Further, for first frame image, zoom factor z is initialized as 0.

Preferably, it is described according to zoom factor z and segmentation block number B, adjust present frame I size method specifically:

As B=1, make H '=H_d, W '=W_d；

As B=2, make H '=H_d、

As B > 2, make

Further, adjust video frame size purpose be to ensure that and partition a frame into B block after, every piece of image resolution ratio is H_d×W_d, meet the input requirements of neural network.

Preferably, the frame after the normalization adjustment size is to take each pixel value in the frame after adjustment size divided by pixel It is worth the upper limit, it is made to normalize to section [0,1].

Preferably, the frame after normalization is divided into B subgraph, specifically includes following 3 kinds of situations:

As B=1, do not make to divide, whole frame is input in network model；

As B=2, by frame vertical segmentation at two parts, the ranks coordinate of pixel in present frame I is respectively indicated with x and y, Then a portion is I_l=I (x, y), 0≤x < W_d, 0≤y < H_d, another part I_r=I (x, y), W '-W_d≤ x < W ', 0 ≤ y < H_d；

Work as B=z²When, frame is divided into z row z column, total z²A subgraph, size are H_d×W_d。

Preferably, valid frame specific steps are screened from candidate target frame are as follows:

Confidence threshold value θ and target frame number upper limit k is set_box, in H_out×_outOnly retain confidence level in × 9 candidate frames Frame not less than θ, and retain quantity and be no more than k_boxIt is a, wherein H_outAnd W_outIt is the height and width for exporting characteristic pattern respectively； The target frame and its corresponding pedestrian's classification confidence level that are retained can be used as the output result of pedestrian detection.

Preferably, it reads when the current frame, if video frame to be detected is single channel (such as gray scale) image, directly duplication should Channel information constructs 3 channel images.

A kind of neural network, including 7 layers of convolutional layer, wherein the 1st layer be regular volume lamination, behind each layer be separable Depth convolutional layer；

Level 1 volume lamination uses 32 3 × 3 filters, standardizes (batch followed by batch Normalization, BN) layer and rectification linear unit (Rectified Liner Units, ReLU) layer；

The block that the separable convolutional layer of depth is made of one group of depth network layer structure successively includes depth convolutional layer, ReLU layers, BN layers, 1 × 1 convolutional layer, ReLU layers, BN layers；

Step-length is used to carry out down-sampling, remaining convolution to characteristic pattern for the convolution kernel of [2,2] in the 1st, 3,5,7 convolutional layers Layer step-length is [1,1]；

Filter quantity in preceding 6 feature extraction layers is followed successively by 32,64,128,128,256,256, remaining feature extraction Filter quantity in layer is 512, and the size of all filters is 3 × 3；

Neural network final output be dimension be (B, H_out,W_out, 512) characteristic pattern, wherein H_outAnd W_outIt is respectively Export the height and width of characteristic pattern.

Preferably, the neural network of the design is a kind of light weight network, and institute's containing parameter total amount is less, stores network structure Only need about 2.3MB.

A kind of target detection layer realizes two parts function, is the prediction of pedestrian target frame coordinate and target frame confidence level respectively Prediction；

The prediction of pedestrian target frame is realized by 4 × 9=36 1 × 1 filters, to each grid prediction 9 on characteristic pattern A candidate target frame, each target frame is by Far Left coordinate x_min, rightmost coordinate x_max, the top coordinate y_minAnd bottom Coordinate y_maxFour parameters determine；

The prediction of target frame confidence level is realized by 2 × 9=18 1 × 1 filters, to 9 candidate mesh on each grid Mark frame calculates its classification confidence level, including two class of pedestrian and background.

Compared with the prior art, the invention has the following advantages and beneficial effects:

Convolutional neural networks used in the present invention are light weight network, and network parameter is few, and algorithm arithmetic speed is fast, high-efficient, real Shi Xingqiang.Self adaptive pantographic is carried out to video frame by zoom factor, in the case where guaranteeing detection accuracy and arithmetic speed, especially Which raises the detection effects to small size pedestrian target.

Detailed description of the invention

Fig. 1 is the flow chart that the embodiment of the present invention carries out pedestrian detection.

Fig. 2 is convolutional neural networks structure chart used in the embodiment of the present invention.

Fig. 3 is testing result schematic diagram of the embodiment of the present invention.

Specific embodiment

For a better understanding of the technical solution of the present invention, the implementation that the present invention is described in detail provides with reference to the accompanying drawing Example, embodiments of the present invention are not limited thereto.

Embodiment

Under the premise of guaranteeing detection and positioning accuracy, the pedestrian detection method based on deep learning is solved The problem of middle efficiency and underspeed.The network model volume very little that the present invention uses, about 2.3MB, consumed computing resource is lower, But it can realize comparatively ideal detection speed and accuracy rate, after being trained on small data set and carrying out small parameter perturbations, can get 84.2% mAP (if when without fine tuning, can get 81% mAP).

One embodiment step of pedestrian detection is described in detail as follows:

The first step determines the default resolution of network reception video in algorithm: H_d×W_d× 3, H is set in this example_d= 256, W_d=448.

Second step reads frame image from picture pick-up device or existing video sequence.It is directly read from IP video camera in this example Taking resolution ratio is 1080 × 1920 × 3 color video, and note current frame image is I.

Third step determines the segmentation block number B of present frame I according to the value of zoom factor z as the following formula:

In this example, for first frame image, zoom factor z is initialized as 0, therefore B=1.For subsequent frame image, if Z=1, then B=2, if z=2, B=4, and so on.

The size adjusting of present frame I is H' × W', specifically included by the 4th step according to zoom factor z and segmentation block number B 3 kinds of situations below:

A. as B=1, make H '=H_d, W '=W_d；

B. as B=2, make H '=H_d、

C. as B > 2, make

In this example, for first frame image, because of B=1, it is applicable in the 1st kind of situation, i.e., is by video frame size adjusting 256×448.For subsequent frame image, if B=2, frame size is adjusted to 256 × 672, if B=4 (corresponding zoom factor z =2 the case where), then frame size is adjusted to 512 × 896, and so on.

5th step, by adjust size after frame in each pixel value divided by the pixel value upper limit, make its normalize to section [0, 1]。

Frame after normalization is divided into B subgraph by the 6th step, specifically includes following 3 kinds of situations:

A. as B=1, do not make to divide, whole frame is input in network model；

It b. is I respectively by frame vertical segmentation at left and right two parts as B=2_l=I (x, y), 0≤x < W_d, 0≤y < H_d And I_r=I (x, y), W '-W_d≤ x < W ', 0≤y < H_d；

C. work as B=z²When, frame is divided into z row z column, total z²A subgraph, size are H_d×W_d。

In this example, for first frame image, because of B=1, it is applicable in the 1st kind of situation, i.e., does not make to divide, it directly will be whole Frame is input in network model.

For subsequent frame image, if B=2, by frame vertical segmentation at left and right two parts, left-hand component includes from the 0th column To the part of the 447th column, right-hand component includes the part from the 224th column to 671 column.If B=4 be (corresponding zoom factor z=2's Situation), then frame even partition is arranged for 2 rows 2, totally 4 subgraphs.Other situations are analogized.

7th step, the subgraph that present frame is divided is according to (B, H_d,W_d, 3) dimension arrangement, be then input to light Feature extraction is carried out in magnitude convolutional neural networks.In this example, for first frame image, because of B=1, input network Data mode is (1,256,448,3)；For subsequent frame image, the data mode for inputting network is (B, 256,448,3).Feature Extract partial nerve network final output be dimension be (B, H_out,W_out, 512) characteristic pattern, in this example, H_outAnd W_outRespectively For 16 and 28.

Characteristic pattern obtained by 7th step is inputted target detection layer, it is corresponding with the frame to obtain pedestrian target frame coordinate by the 8th step Confidence level.Target detection layer realizes two parts function, is the prediction of pedestrian target frame coordinate and the prediction of target frame confidence level respectively.

The prediction of pedestrian target frame predicts 9 candidate target frames to each grid on characteristic pattern, and each target frame is by most Left side coordinate x_min, rightmost coordinate x_max, the top coordinate y_minAnd coordinate y bottom_maxFour parameters determine.This embodiment In, for first frame image, it need to predict 16 × 28 × 9=4032 candidate target frame；For subsequent frame image, if B=2, It need to predict 2 × 16 × 28 × 9=8064 candidate target frame；And so on.

The prediction of target frame confidence level calculates its classification confidence level, including pedestrian and background two to each candidate target frame Class.

The structure of convolutional neural networks (including the 7th step and the 8th step) used is as shown in Figure 2 in embodiment.

9th step screens valid frame from candidate target frame.Confidence threshold value θ and target frame number upper limit k is set_box, from Only retain the frame that confidence level is not less than θ in the candidate frame of previous step prediction, and retains quantity and be no more than k_boxIt is a.The mesh retained Mark frame and its corresponding classification confidence level can be used as the output result of pedestrian detection.In this embodiment, θ=0.01, k are set_box =200.

Tenth step calculates the average height H for all pedestrians that present frame detects_ped, and given threshold H_{θ_min}And H_{θ_max}, If H_ped< H_{θ_min}, then zoom factor z is increased by 1, if H_ped> H_{θ_max}, then zoom factor z is reduced 1, other situations are then kept Zoom factor is constant.Second step is repeated, next frame video is detected, until whole section of video detection finishes.In this embodiment, if Determine H_{θ_min}=80, H_{θ_max}=150.

A kind of lightweight convolutional neural networks, including 7 layers of convolutional layer, wherein the 1st layer be regular volume lamination, behind each layer it is equal For separable depth convolutional layer；

A. level 1 volume lamination uses 32 3 × 3 filters, standardizes (batch followed by batch Normalization, BN) layer and rectification linear unit (Rectified Liner Units, ReLU) layer；

B. the block that the separable convolutional layer of depth is made of one group of depth network layer structure successively includes depth convolution Layer, ReLU layers, BN layers, 1 × 1 convolutional layer, ReLU layers, BN layers；

C. step-length is used to carry out down-sampling to characteristic pattern for the convolution kernel of [2,2] in the 1st, 3,5,7 convolutional layers, remaining volume Lamination step-length is [1,1]；

D. the filter quantity in preceding 6 feature extraction layers is followed successively by 32,64,128,128,256,256, remaining feature mentions Taking the filter quantity in layer is 512, and the size of all filters is 3 × 3；

The neural network of the design is a kind of light weight network, and institute's containing parameter total amount is less, and storage network structure only needs about 2.3MB；Convolutional layer can also be 8 layers or 9 layers.

In embodiment, wherein the testing result of a frame is not as shown in figure 3, the training process of convolutional neural networks uses self shrinking It puts, 256 × 448 training image is directly sent into network according to batch size of B=32 and is trained.Trained objective function knot The smooth L1 loss that position is confined for the Classification Loss (Softmax loss) of confidence level prediction and pedestrian is closed.

The computer CPU of operation embodiment is configured to Corei5-6500 3.20GHz × 4, and GPU is configured to GTX1080Ti, Software environment is Caffe (1.0.0-rc3 version), for the video of 256 × 448 sizes, completes the pedestrian detection of a frame only about 7 Millisecond, it is averagely per second to handle 142 frames, it can satisfy most of real time monitoring processing requirement.

The above embodiment is a preferred embodiment of the present invention, but embodiments of the present invention are not by above-described embodiment Limitation, other any changes, modifications, substitutions, combinations, simplifications made without departing from the spirit and principles of the present invention, It should be equivalent substitute mode, be included within the scope of the present invention.

Claims

1. a kind of real-time pedestrian detection method, which is characterized in that divided automatically video frame according to the size of pedestrian in video It cuts, single iteration is carried out in single width video frame, export pedestrian target frame and pedestrian's confidence；Include the following steps:

Determine the default resolution of network reception video in algorithm: H_d×W_d× 3, wherein H_d、W_dRespectively refer to the height and width of image Degree, the color channel number that 3 finger images include；

Present frame I is read, resolution ratio is H × W × 3；

According to zoom factor z and segmentation block number B, the size of adjustment present frame I is H ' × W '；

Pixel value in frame after normalization adjustment size；

Frame after segmentation normalization is B subgraph；

The subgraph that present frame is divided is according to (B, H_d, W_d, 3) dimension arrangement, carry out feature extraction, and obtain feature The pedestrian target frame coordinate of figure confidence level corresponding with the frame；

Valid frame is screened from target frame, the target frame of reservation and its corresponding pedestrian's classification confidence level can be used as pedestrian detection Output result；

Calculate the average height H of all pedestrians detected in present frame_ped, and set minimum and highest threshold value H_{θ_min}And H_{θ_max}, If H_ped< H_{θ_min}, then zoom factor z is increased by 1, if H_ped>H_{θ_max}, then zoom factor z is reduced 1, other situations are then kept Zoom factor is constant；Detection next frame video is repeated, until whole section of video detection finishes.

2. real-time pedestrian detection method according to claim 1, which is characterized in that the segmentation block of the determining present frame I Number B's method particularly includes:

3. real-time pedestrian detection method according to claim 2, which is characterized in that for first frame image, will scale because Sub- z is initialized as 0.

4. real-time pedestrian detection method according to claim 1, which is characterized in that described according to zoom factor z and segmentation Block number B, the method for adjusting present frame I size specifically:

As B=1, make H '=H_d, W '=W_d；

As B=2, make H '=H_d、

As B > 2, make

5. real-time pedestrian detection method according to claim 1, which is characterized in that the frame after the normalization adjustment size It is that each pixel value makes it normalize to section [0,1] divided by the pixel value upper limit in the frame after adjusting size.

6. real-time pedestrian detection method according to claim 1, which is characterized in that the frame after normalization is divided into B Subgraph specifically includes following 3 kinds of situations:

As B=1, do not make to divide, whole frame is input in network model；

As B=2, by frame vertical segmentation at two parts, the ranks coordinate of pixel in present frame I is respectively indicated with x and y, then its Middle a part is I_l=I (x, y), 0≤x < W_d, 0≤y < H_d, another part I_r=I (x, y), W '-W_d≤ x < W ', 0≤y < H_d；

7. real-time pedestrian detection method according to claim 1, which is characterized in that screen valid frame from candidate target frame Specific steps are as follows:

Confidence threshold value θ and target frame number upper limit k is set_box, in H_out×W_outOnly retain confidence level not in × 9 candidate frames Frame less than θ, and retain quantity and be no more than k_boxIt is a, wherein H_outAnd W_outIt is the height and width for exporting characteristic pattern respectively；Institute The target frame of reservation and its corresponding pedestrian's classification confidence level can be used as the output result of pedestrian detection.

8. real-time pedestrian detection method according to claim 1, which is characterized in that read when the current frame, if view to be detected When frequency frame is single channel image, then the channel information is directly replicated, constructs 3 channel images.

9. a kind of neural network, which is characterized in that including 7 layers of convolutional layer, wherein the 1st layer is regular volume lamination, behind each layer it is equal For separable depth convolutional layer；

Level 1 volume lamination uses 32 3 × 3 filters, followed by batch normalization layer and rectifies linear elementary layer；

The block that the separable convolutional layer of depth is made of one group of depth network layer structure successively includes depth convolutional layer, ReLU Layer, BN layers, 1 × 1 convolutional layer, ReLU layers, BN layers；

Step-length is used to carry out down-sampling, remaining convolutional layer step to characteristic pattern for the convolution kernel of [2,2] in the 1st, 3,5,7 convolutional layers A length of [1,1]；

Filter quantity in preceding 6 feature extraction layers is followed successively by 32,64,128,128,256,256, in remaining feature extraction layer Filter quantity be 512, the size of all filters is 3 × 3；

Neural network final output be dimension be (B, H_out, W_out, 512) characteristic pattern, wherein H_outAnd W_outIt is output respectively The height and width of characteristic pattern.

10. a kind of target detection layer, which is characterized in that the target detection layer realizes two parts function, is pedestrian target frame respectively Coordinate prediction and the prediction of target frame confidence level；

The prediction of pedestrian target frame is realized by 4 × 9=36 1 × 1 filters, predicts 9 times to each grid on characteristic pattern Target frame is selected, each target frame is by Far Left coordinate x_min, rightmost coordinate x_max, the top coordinate y_minAnd coordinate bottom y_maxFour parameters determine；

The prediction of target frame confidence level is realized by 2 × 9=18 1 × 1 filters, to 9 candidate target frames on each grid Calculate its classification confidence level, including two class of pedestrian and background.