CN109214241A

CN109214241A - Pedestrian detection method based on deep learning algorithm

Info

Publication number: CN109214241A
Application number: CN201710531902.1A
Authority: CN
Inventors: 葛水英; 杨东明
Original assignee: National Science Library Chinese Academy Of Sciences
Current assignee: National Science Library Chinese Academy Of Sciences
Priority date: 2017-07-03
Filing date: 2017-07-03
Publication date: 2019-01-15

Abstract

The invention discloses a kind of pedestrian detection methods based on deep learning algorithm, wherein the method comprising the steps of 1: carrying out feature extraction to input picture using basic network, the additional Analysis On Multi-scale Features layer part in network is utilized to carry out the extraction of Analysis On Multi-scale Features figure；Step 2: each location point generates fixed candidate frame according to different scale and length-width ratio in characteristic pattern；Step 3: the classification of 3*3 convolution is carried out on characteristic pattern, it includes 1 confidence score of pedestrian's object and 4 positional shift values for each candidate frame return；Step 4: loss being calculated according to trained objective function, obtains final training pattern.Gained model carries out pedestrian detection, after obtaining the confidence score and location information of candidate frame, uses non-maximum restraining operation to select final candidate frame and exports as detection.The present invention effectively improves the performance and real-time of pedestrian detection, can reach the 60 frames above speed (GPU of Titan X) per second, 75% or so Average Accuracy in the detection.

Description

Pedestrian detection method based on deep learning algorithm

Technical field

The present embodiments relate to computer image processing technology fields, more particularly, to image detection and target identification side Method.

Background technique

Pedestrian detection is an important branch of field of target recognition, is research hotspot in recent years.Pedestrian detection technology It is basis and the guarantee of the researchs such as pedestrian tracking, behavioural analysis, gait analysis, pedestrian's identification, in such as auxiliary driving, intelligence Possess broad application prospect in the fields such as energy monitoring, advanced man-machine interface.Conventional pedestrian's detection method is using engineer's Feature carries out pedestrian detection, these features are integrated into distorted pattern and block in model later, however these methods are limited In the feature of the engineer of reduced levels, the problems such as pedestrian is blocked can not be solved.Since 2012, convolutional neural networks (Convolutional neural networks, CNNs are used for target detection and introduce pedestrian detection, departing from manually setting The constraint of feature is counted, however due to the complicated network structure and computationally intensive, these methods all have poor real-time.

The present invention proposes a kind of pedestrian detection network model based on deep learning, and detection accuracy and detection speed are relatively existing Method, which has, to be substantially improved and preferably balances.Meanwhile, it is capable to guarantee detection when the lower image of input resolution ratio Precision.The end-to-end design of model entirety is so that trained and detection process becomes simply, can settle at one go.

Summary of the invention

The main purpose of the embodiment of the present invention is to provide a kind of pedestrian detection method based on deep learning algorithm.

To achieve the goals above, according to an aspect of the invention, there is provided following technical scheme:

A kind of pedestrian detection method based on deep learning algorithm, this method include at least:

Step 1: input picture being normalized into unified size as input, utilizes the facilities network in designed depth network Network part carries out the extraction of foundation characteristic figure to input picture.The additional Analysis On Multi-scale Features layer in designed depth network is utilized later Part carries out the extraction of Analysis On Multi-scale Features figure to above-mentioned gained foundation characteristic figure；

Step 2: each location point (feature map location) in Analysis On Multi-scale Features figure described in step 1 is not according to With size scale and a series of corresponding different fixation candidate frames (default box) of generation of length-breadth ratio；

Step 3: the classification of 3*3 convolution, each candidate in each location point are carried out on the Analysis On Multi-scale Features figure described in step 1 It includes 1 confidence score (score) of pedestrian's object and 4 positional shift values (offsets) for frame return.It is every when training Jaccard overlap between a candidate frame and any one ground truth is greater than certain threshold value, is just used as positive sample, Others sort according to confidence level, and that chooses 3 times of positive sample quantity is used as negative sample；

Step 4: loss, and back-propagation and iteration being calculated according to trained objective function, obtain final training pattern.Institute It obtains model and carries out pedestrian detection, wherein picture is sequentially completed step 1~3, obtains the confidence score and location information of candidate frame Afterwards, use non-maximum restraining operation to select final candidate frame to export as detection.

Further, the step 1 specifically includes:

Input picture is normalized into unified size, puts into designed pedestrian detection depth network.The network includes Basic network part and additional features layer.It is taken using Poo15 layers of VGG16 network and framework before the basic network part It builds, comprising 5 convolutional layers and 5 MaxPooling operations, which carries out the extraction of foundation characteristic figure to input picture.Institute Stating additional features layer includes 3 convolutional layers and one pool layers of average.It is related to convolution operation and is followed successively by Conv6 (3*3* 1024), Conv7 (1*1*1024), Conv8_1 (1*1*256), Conv8_2 (3*3*512-s2), Conv9_1 (1*1*128), Conv9_2 (3*3*256-s2), Conv10_1 (1*1*128), Conv10_2 (3*3*256-s2), Avg Pooling: Global, the subnetwork carry out the extraction of Analysis On Multi-scale Features figure to above-mentioned gained foundation characteristic figure.Network overall construction drawing is such as Shown in Fig. 1.

Further, gained different scale characteristic pattern specifically includes in the step 1:

19*19*1024 dimensional feature figure obtained by Conv7, the resulting 10*10*512 dimensional feature figure of Conv8_2, Conv9_2 institute Obtain 5*5*256 dimensional feature figure, 3*3*256 dimensional feature figure obtained by Conv10_2, Avg Pooling:Global gained 1*1*256 dimension Characteristic pattern, totally 5 characteristic patterns, characteristic pattern size variation obtained by these convolutional layers is bigger, for detecting under different scale Object.

Further, each location point in the step 2 in Analysis On Multi-scale Features figure is according to different size scale and length and width A series of corresponding different fixation candidate frames of generation of ratio specifically include:

The corresponding candidate frame for generating 6 kinds of size scales and length-breadth ratio of each location point in Analysis On Multi-scale Features figure.It is described The length-breadth ratio of candidate frame is respectively that default boundary frame, transverse and longitudinal ratio are respectivelyCandidate frame, such as Fig. 2 institute Show.

In 5 characteristic patterns, the candidate frame size in each characteristic pattern calculates as follows:

For above-mentioned default frame, size are as follows:

Wherein, m indicates characteristic pattern number, herein m=5.s_minValue 0.2, s_maxValue 0.95, it is meant that the ruler of lowermost layer Degree is 0.2, and top scale is 0.95.Then, each candidate frame size are as follows:

The center of each candidate frame is arranged are as follows:Wherein | f_k| it is the size of k-th of characteristic pattern, meanwhile, i, j ∈ [0, | f_k|]。

Further, the step 3 specifically includes:

The classification of 3*3 convolution is carried out on the Analysis On Multi-scale Features figure described in step 1, each candidate frame in each location point returns It includes 1 confidence score of pedestrian's object and 4 positional shift values.It is defeated altogether in 6 candidate frames on i.e. each location point (1+4) * 6 values out.For the characteristic pattern of a m*n size, i.e., it can generate (1+4) * 6*m*n output result.

In the training stage, the ground truth of input picture and each object is given.For each candidate frame and appoint As soon as the jaccard overlap for the ground truth that anticipates is greater than 0.5, it is used as positive sample, therefore allow a ground Truth corresponds to multiple candidate frames.Other frames sort according to confidence level, and that chooses 3 times of positive sample quantity is used as negative sample.

Further, the step 4 specifically includes:

Trained loss function (objective loss function) loses (confidence loss) by confidence level Loss (confidence loss) two parts weighted sum is returned with position to constitute.Confidence level is using Softmax Loss, position It puts back into and returns, be using Smooth L1 loss.Calculation formula is as follows:

Wherein: N is the candidate frame number to match with ground truth box.L_confIt is defeated using Softmax Loss Enter for the confidence level c of every one kind.L_locUsing Smooth L1 Loss.α is weight term, is set as 1.

The formula of the confidence level loss is as follows:

The formula that the position returns loss is as follows:

Detailed description of the invention

Attached drawing is as a part of the invention, and for providing further understanding of the invention, of the invention is schematic Examples and descriptions thereof are used to explain the present invention, but does not constitute an undue limitation on the present invention.Obviously, the accompanying drawings in the following description Only some embodiments to those skilled in the art without creative efforts, can be with Other accompanying drawings can also be obtained according to these attached drawings.In the accompanying drawings:

Fig. 1 is depth network structure of the invention.

Fig. 2 is default boundary frame, candidate frame length-breadth ratio is respectivelyCandidate frame.

Fig. 3 is network training data sample.

Fig. 4 is using Fig. 3 sample as the output par, c result of the Partial Feature figure of case.

Fig. 5 is that candidate frame matching result is shown by case of a part of Fig. 3 sample.

Fig. 6 is that pedestrian detection exports result case.

These attached drawings and verbal description are not intended to the conception range limiting the invention in any way, but by reference to Specific embodiment is that those skilled in the art illustrate idea of the invention.

Specific embodiment

The technical issues of with reference to the accompanying drawing and specific embodiment is solved to the embodiment of the present invention, used technical side Case and the technical effect of realization carry out clear, complete description.Obviously, described embodiment is only one of the application Divide embodiment, is not whole embodiments.Based on the embodiment in the application, those of ordinary skill in the art are not paying creation Property labour under the premise of, all other equivalent or obvious variant the embodiment obtained is fallen within the scope of protection of the present invention. The embodiment of the present invention can be embodied according to the multitude of different ways being defined and covered by claim.

Pedestrian detection method based on deep learning algorithm includes 4 steps.

Step 1: input picture being normalized into unified size, puts into designed pedestrian detection depth network.When training, For using INRIA pedestrian's data set as training set, the original image of input and the image sample after addition label are respectively such as Fig. 3 (a) and shown in Fig. 3 (b).Training carries out 60K times altogether, is updated using SGD, learning rate is initialized as 0.001, momentum Being set as 0.9, weight decay is 0.0005.

The network includes basic network part and additional features layer.The basic network part uses VGG16 network Poo15 layers and framework before are built, and comprising 5 convolutional layers and 5 MaxPooling operations, the subnetwork is to input picture Carry out the extraction of foundation characteristic figure.The additional features layer includes 3 convolutional layers and one pool layers of average.It is related to convolution Operation is followed successively by Conv6 (3*3*1024), Conv7 (1*1*1024), Conv8_1 (1*1*256), Conv8_2 (3*3*512- S2), Conv9_1 (1*1*128), Conv9_2 (3*3*256-s2), Conv10_1 (1*1*128), Conv10_2 (3*3*256- S2), Avg Pooling:Global, the subnetwork carry out the extraction of Analysis On Multi-scale Features figure to above-mentioned gained foundation characteristic figure.

Further, the different scale characteristic pattern specifically includes:

19*19*1024 dimensional feature figure obtained by Conv7, the resulting 10*10*512 dimensional feature figure of Conv8_2, Conv9_2 institute Obtain 5*5*256 dimensional feature figure, 3*3*256 dimensional feature figure obtained by Conv10_2, Avg Pooling:Global gained 1*1*256 dimension Characteristic pattern, totally 5 characteristic patterns, characteristic pattern size variation obtained by these convolutional layers is bigger, for detecting under different scale Object.Fig. 4 is illustrated using Fig. 3 sample as the output result of the Partial Feature figure of case.

Step 2: each location point in Analysis On Multi-scale Features figure described in step 1 is according to different size scale and length-breadth ratio pair A series of different fixation candidate frames should be generated.Candidate frame specifically includes the corresponding generation of each location point in Analysis On Multi-scale Features figure The candidate frame of 6 kinds of size scales and length-breadth ratio.The length-breadth ratio of the candidate frame is respectively default boundary frame, transverse and longitudinal than difference ForCandidate frame.

Candidate frame size in each characteristic pattern calculates as follows:

For above-mentioned default frame, size are as follows:

Candidate frame matching result such as Fig. 5 is shown by case of a part of Fig. 3 sample.

Step 3: the classification of 3*3 convolution, each candidate in each location point are carried out on the Analysis On Multi-scale Features figure described in step 1 It includes 1 confidence score of pedestrian's object and 4 positional shift values for frame return.6 candidate frames on i.e. each location point In output (1+4) * 6 value altogether.For the characteristic pattern of a m*n size, i.e., it can generate (1+4) * 6*m*n output result.

Step 4: loss, and back-propagation and iteration being calculated according to trained objective function, obtain final training pattern.Instruction Experienced loss function returns loss two parts weighted sum by confidence level loss and position and constitutes.Confidence level is using Softmax Loss, it is using Smooth L1 loss that position, which returns then,.Calculation formula is as follows:

The formula of the confidence level loss is as follows:

The formula that the position returns loss is as follows:

Gained model carries out pedestrian detection, and wherein picture is sequentially completed step 1~3, obtains the confidence score of candidate frame After location information, uses non-maximum restraining operation to select final candidate frame and exported as detection.Output result case is for example schemed 6。

The present invention effectively improves the performance and real-time of pedestrian detection, can reach in the detection compared with existing algorithm The 60 frames above speed (GPU of Titan X) per second, 75% or so Average Accuracy, well very than current state-of-the-art method It is more.The design of this paper simultaneously can guarantee the precision of detection when the lower image of input resolution ratio.It is whole to set end to end Meter, so that training and detection can settle at one go.

Each step of the invention can be realized with general computing device, for example, they can concentrate on it is single On computing device, such as: personal computer, server computer, handheld device or portable device, laptop device or more Processor device can also be distributed over a network of multiple computing devices, they can be to be different from sequence herein Shown or described step is executed, perhaps they are fabricated to each integrated circuit modules or will be more in them A module or step are fabricated to single integrated circuit module to realize.Therefore, the present invention is not limited to any specific hardware and soft Part or its combination.

Programmable logic device can be used to realize in method provided by the invention, and it is soft also to may be embodied as computer program Part or program module (it include routines performing specific tasks or implementing specific abstract data types, programs, objects, component or Data structure etc.), such as embodiment according to the present invention can be a kind of computer program product, run the computer program Product executes computer for demonstrated method.The computer program product includes computer readable storage medium, should It include computer program logic or code section on medium, for realizing the method.The computer readable storage medium can To be the built-in medium being mounted in a computer or the removable medium (example that can be disassembled from basic computer Such as: using the storage equipment of hot plug technology).The built-in medium includes but is not limited to rewritable nonvolatile memory, Such as: RAM, ROM, flash memory and hard disk.The removable medium includes but is not limited to: and optical storage media (such as: CD- ROM and DVD), magnetic-optical storage medium (such as: MO), magnetic storage medium (such as: tape or mobile hard disk), can with built-in Rewrite the media (such as: storage card) of nonvolatile memory and the media (such as: ROM box) with built-in ROM.

Present invention is not limited to the embodiments described above, and without departing substantially from substantive content of the present invention, this field is common Any deformation, improvement or the replacement that technical staff is contemplated that each fall within the scope of the present invention.

Although having been shown above, being described and pointed out basic novel feature of the invention suitable for various embodiments Detailed description, it will be understood that do not depart from the invention is intended in the case where, those skilled in the art can be to system Form and details carry out various omissions, substitutions and changes.

Claims

1. a kind of pedestrian detection method based on deep learning algorithm, which is characterized in that this method includes at least:

Step 1: input picture being normalized into unified size as input, utilizes the basic network portion in designed depth network Divide and the extraction of foundation characteristic figure is carried out to input picture.The additional Analysis On Multi-scale Features layer part in designed depth network is utilized later The extraction of Analysis On Multi-scale Features figure is carried out to above-mentioned gained foundation characteristic figure；

Step 2: each location point (feature map location) in Analysis On Multi-scale Features figure described in step 1 is according to different big Small scale and a series of corresponding different fixation candidate frames (default box) of generation of length-breadth ratio；

Step 3: the classification of 3*3 convolution is carried out on the Analysis On Multi-scale Features figure described in step 1, each candidate frame in each location point returns Return it includes 1 confidence score (score) of pedestrian's object and 4 positional shift values (offsets).Each candidate when training Jaccard overlap between frame and any one ground truth is greater than certain threshold value, is just used as positive sample, other It sorts according to confidence level, that chooses 3 times of positive sample quantity is used as negative sample；

Step 4: loss, and back-propagation and iteration being calculated according to trained objective function, obtain final training pattern.Gained mould Type carries out pedestrian detection, and wherein picture is sequentially completed step 1~3, after obtaining the confidence score and location information of candidate frame, makes It uses non-maximum restraining operation to select final candidate frame to export as detection.

2. the method according to claim 1, wherein the step 1 specifically includes:

Input picture is normalized into unified size, puts into designed pedestrian detection depth network.The network includes basis Network portion and additional features layer.The basic network part is built using Pool5 layers of VGG16 network and framework before, is wrapped Containing 5 convolutional layers and 5 MaxPooling operations, the extraction of foundation characteristic figure is carried out to input picture.The additional features layer packet Include 3 convolutional layers and one pool layers of average.It is related to convolution operation and is followed successively by Conv6 (3*3*1024), Conv7 (1*1* 1024), Conv8_1 (1*1*256), Conv8_2 (3*3*512-s2), Conv9_1 (1*1*128), Conv9_2 (3*3*256- S2), Conv10_1 (1*1*128), Conv10_2 (3*3*256-s2), Avg Pooling:Global, to above-mentioned gained basis The extraction of characteristic pattern progress Analysis On Multi-scale Features figure.

3. according to the method described in claim 2, it is characterized in that, gained different scale characteristic pattern specifically includes: obtained by Conv7 19*19*1024 dimensional feature figure, the resulting 10*10*512 dimensional feature figure of Conv8_2, Conv9_2 gained 5*5*256 dimensional feature figure, 3*3*256 dimensional feature figure obtained by Conv10_2,1*1*256 dimensional feature figure obtained by Avg Pooling:Global, totally 5 features Figure, these convolutional layers gained characteristic pattern size variation is bigger, for detecting the object under different scale.

4. the method according to claim 1, wherein the step 2 specifically includes:

The corresponding candidate frame for generating 6 kinds of size scales and length-breadth ratio of each location point in Analysis On Multi-scale Features figure.

The length-breadth ratio of the candidate frame is respectively that default boundary frame, transverse and longitudinal ratio are respectivelyCandidate frame.Often Candidate frame size in one characteristic pattern calculates as follows:

For above-mentioned default frame, size are as follows:

Wherein, m indicates characteristic pattern number, herein m=5.s_minValue 0.2, s_maxValue 0.95, it is meant that the scale of lowermost layer is 0.2, top scale is 0.95.Then, each candidate frame size are as follows:

5. the method according to claim 1, wherein the step 3 specifically includes:

The classification of 3*3 convolution is carried out on the Analysis On Multi-scale Features figure described in step 1, each candidate frame in each location point returns to its packet 1 confidence score of the object containing pedestrian and 4 positional shift values.(1+ is exported in 6 candidate frames on i.e. each location point altogether 4) * 6 values.For the characteristic pattern of a m*n size, i.e., it can generate (1+4) * 6*m*n output result.

In the training stage, the ground truth of input picture and each object is given.For each candidate frame with it is any one The jaccard overlap of a ground truth is greater than 0.5, is just used as positive sample, therefore allow a ground truth Corresponding multiple candidate frames.Other frames sort according to confidence level, and that chooses 3 times of positive sample quantity is used as negative sample.

6. the method according to claim 1, wherein the step 4 specifically includes:

Trained loss function (objective loss function) is by confidence level loss (confidence loss) and position It puts back into and loss (confidence loss) two parts weighted sum is returned to constitute.Confidence level is using Softmax Loss, and position is returned Gui Ze is using Smooth L1 loss.Calculation formula is as follows:

Wherein: N is the candidate frame number to match with ground truth box.L_confUsing Softmax Loss, it is every for inputting A kind of confidence level c.L_locUsing Smooth L1 Loss.α is weight term, is set as 1.

The formula of the confidence level loss is as follows:

The formula that the position returns loss is as follows: