CN108446662A

CN108446662A - A kind of pedestrian detection method based on semantic segmentation information

Info

Publication number: CN108446662A
Application number: CN201810283404.4A
Authority: CN
Inventors: 杨昕梅; 李耀斌; 高原; 杨承; 李绍荣
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2018-04-02
Filing date: 2018-04-02
Publication date: 2018-08-24

Abstract

The invention discloses a kind of pedestrian detection method based on semantic segmentation information, it is related to the pedestrian detection method field based on neural network；It includes 1：Original RGB image input core network, corresponding semantic segmentation image in training set sample are input to branching networks, and the loss function that overall network is arranged completes training；2：The core network that original RGB image input in test set sample is completed to training carries out convolution feature extraction generation multilayer feature figure；3：The Area generation network progress pedestrian candidate frame extraction that multilayer feature figure is inputted to completion training generates pedestrian candidate region；4：It completes to export the detection result image for including pedestrian position encirclement frame after trained classification Recurrent networks are classified and positioned to pedestrian candidate region；It solves the problems, such as that existing pedestrian detection causes accuracy of detection low in low resolution pedestrian with the differentiation of background difficulty, improves the precision of pedestrian detection in the case of low resolution.

Description

A kind of pedestrian detection method based on semantic segmentation information

Technical field

It is especially a kind of based on semantic segmentation information the present invention relates to the pedestrian detection method field based on neural network Pedestrian detection method.

Background technology

Pedestrian detection technology is a most basic and common target detection technique in practical applications, it is human body behavior Analysis, Gait Recognition, intelligent video monitoring and automatic Pilot technology basis.It is emerging with convolutional neural networks in recent years It rises, huge advance is had been achieved in object detection field, but in pedestrian's detection field, still there is two big challenges：

One, it is detected compared to general target, pedestrian target and background are more difficult to distinguish.Such as in low resolution, Pedestrian target and traffic lights, column shape target have a closely similar surface characteristics, the pixel distribution of pedestrian and background Pixel distribution very close to.

Two, it how to be accurately positioned each pedestrian target, in practical applications, usually will appear intensive crowd's scene, Detector can not be accurately positioned pedestrian target to generate flase drop and missing inspection.In convolutional neural networks, convolutional layer and pond layer On the one hand high-layer semantic information is generated, the boundary of adjacent target has on the other hand also been obscured, has caused flase drop and missing inspection serious；Therefore Need a kind of pedestrian detection method that can realize the method accurately detected in low resolution.

Invention content

It is an object of the invention to：The present invention provides a kind of pedestrian detection methods based on semantic segmentation information, solve Existing pedestrian detection leads to the differentiation of background difficulty that accuracy of detection is low, is based on convolutional Neural net in low resolution pedestrian Network is divided using semantic information obscures the problem of adjacent target boundary leads to missing inspection and flase drop.

The technical solution adopted by the present invention is as follows：

A kind of pedestrian detection method based on semantic segmentation information, includes the following steps

Step 1：By core network, the corresponding semanteme point in original RGB image input overall network in training set sample The branching networks that image is input in overall network are cut, and the loss function that overall network is arranged completes training；

Step 2：Original RGB image input in test set sample is completed to core network in the overall network of training to roll up Product feature extraction generates multilayer feature figure；

Step 3：Multilayer feature figure is inputted to Area generation network in the overall network for completing training and carries out pedestrian candidate frame Extraction generates pedestrian candidate region；

Step 4：The classification Recurrent networks completed in the overall network of training are classified and are positioned to pedestrian candidate region Output includes the detection result image of pedestrian position encirclement frame afterwards.

Preferably, the step 1 includes the following steps：

Step 1.1：The branching networks of core network in overall network initialize and determine loss proportion λ_i；

Step 1.2：Original RGB image inputs to core network, that corresponding semantic segmentation image is input to branching networks is complete At foreground sample and background sample selection and generate multilayer feature figure；

Step 1.3：Semantic segmentation loss function L pixel-by-pixel is determined based on the multilayer feature figure of semantic segmentation image_ss, used Formula 1 is as follows：

Wherein, H indicates that characteristic pattern height, W indicate characteristic pattern width, p_x,yIndicate the feature at characteristic pattern position (x, y) Value, q_x,yIndicate that known corresponding supervisory signals, l (p, q) indicate cross entropy loss function, l (p, q)=- plogq- (1-p) log(1-q)；

Step 1.4：Multilayer feature figure based on original RGB image carries out the instruction of Area generation network and Recurrent networks of classifying Practice and determines corresponding loss function；

Step 1.5：Determine that total losses function completes training based on step 1.4 and step 1.3, formula 2 used is as follows：

Wherein, λ_iIndicate loss proportion,Indicate the Classification Loss function of Area generation network,Indicate region The candidate frame for generating network positions loss function,The Classification Loss function of presentation class Recurrent networks,It indicates The positioning loss function of classification Recurrent networks, L_SSIndicate semantic segmentation loss function pixel-by-pixel.

Preferably, the step 3 includes the following steps：

Step 3.1：After multilayer feature figure to be inputted to the Area generation network in the overall network for completing training, multilayer feature Figure size is W*H, and channel is the image of C, and each position based on image generates M*N candidate frame, and wherein W indicates that characteristic pattern is wide Degree, H indicate that characteristic pattern height, C indicate that image channel number, M indicate that areal array number, N indicate ratio number of combinations；

Step 3.2：Layer of classifying in selection and step 3.1 Area generation network based on foreground sample and background sample exports Belong to the probability of foreground and background in image on each position in M*N candidate frame, layer belongs to defeated from C dimensional features to classify Go out the probability i.e. confidence score of encirclement frame of foreground and background；

Step 3.3：Layer, which is returned, based on candidate frame in step 3.1 Area generation network exports M*N candidate on each position Frame corresponds to window translation zooming parameter, is exported from C dimensional features for the 4 translation contractings of refine candidate frame to which candidate frame returns layer Put parameter；

Step 3.4：Candidate frame is ranked up according to the confidence score of encirclement frame, selects highest scoring person by non-inhibited Operation obtains multiple candidate result, that is, pedestrian candidate regions.

Preferably, the step 4 includes the following steps：

Step 4.1：Pedestrian candidate region is inputted the classification Recurrent networks in the overall network for completing training to position and Classification results complete refine using corresponding 4 translation zooming parameters；

Step 4.2：Finally output includes the detection result image of pedestrian position encirclement frame.

In conclusion by adopting the above-described technical solution, the beneficial effects of the invention are as follows：

1. the present invention, which by the way that loss function is arranged in the training stage, increases, inputs semantic segmentation image progress team surveillance, It realizes and trained pedestrian's supervisory signals is increased by combination semantic segmentation information under low resolution, the information of more pedestrians is provided, Help pedestrian to be distinguished from background, it is difficult in low resolution pedestrian and background to solve existing pedestrian detection The problem for causing accuracy of detection low is distinguished, performance of the pedestrian detection under real scene is improved；

2. the present invention trains whole network, balance semantic segmentation to lose letter by core network and branching networks team surveillance The loss function accounting of number and core network, provides the supervision message of pixel scale, helps to detach adjacent target, solve Divided using semantic information based on convolutional neural networks and obscure the problem of adjacent target boundary leads to missing inspection and flase drop, improved close The shortcomings that being accurately positioned the precision of pedestrian target under crowd's scene of collection, avoiding generating flase drop and missing inspection；

3. the Lss loss functions that the present invention adds are the supervisory signals of pixel scale, finer control information is provided, one Aspect can improve the setting accuracy of conventional pedestrian, on the other hand be more easily detected the difficult sample that pedestrian's background is not easily distinguishable This, improves the precision that pedestrian target is accurately positioned under intensive crowd's scene.

Description of the drawings

Examples of the present invention will be described by way of reference to the accompanying drawings, wherein：

Fig. 1 is training and the test block diagram of the present invention；

Fig. 2 is flow chart of the method for the present invention；

Fig. 3 is the input original image of the present invention；

Fig. 4 is the semantic segmentation image of the present invention；

Fig. 5 is the pedestrian detection result output image of the present invention.

Specific implementation mode

All features disclosed in this specification or disclosed all methods or in the process the step of, in addition to mutually exclusive Feature and/or step other than, can combine in any way.

It elaborates with reference to Fig. 1-5 couples of present invention.

Embodiment 1

Embodiment 2

Step 1 includes the following steps：

Initialization is as follows：The parameter initialization of core network uses pre-training initialization mode, the initialization of branching networks Using random initializtion, preceding 60,000 iteration, learning rate 0.001, rear 20,000 iteration, using 0.0001, momentum is set learning rate It is set to 0.9, weight decaying is set as 0.0005, loss proportion λ_i1 is taken, loss proportion is determined according to different training sets.

Selection foreground sample and background sample are as follows：After demarcating positive negative sample, it is demarcated as positive sample candidate region to each, It overlaps the maximum candidate frame of ratio and is denoted as foreground sample；The remaining candidate frame of above step, if itself and some calibration weight Folded ratio is more than 0.7, is denoted as foreground sample；If the overlap proportion of itself and any one calibration is both less than 0.3, it is denoted as background sample This；Remaining candidate frame is chosen to above step to be discarded；Candidate frame more than image boundary is discarded.

Step 3 includes the following steps：

Step 3.1：After multilayer feature figure to be inputted to the Area generation network in the overall network for completing training, multilayer feature Figure size is W*H, and channel is the image of C, and each position based on image generates M*N candidate frame, and wherein W indicates that characteristic pattern is wide Degree, H indicate that characteristic pattern height, C indicate that image channel number, M indicate that areal array number, N indicate ratio number of combinations；The present embodiment M* N is 3*3, and M combinations include 128*128,256*256,512*512, and N combinations include 1：1、1：2、2：1, according to different training sets Select numerical value different；

Step 3.4：Candidate frame is ranked up according to the confidence score of encirclement frame, selects highest scoring person by non-inhibited Operation obtains multiple candidate result, that is, pedestrian candidate regions.Area generation network is a full convolutional network, and main there are two 1x1 Convolutional layer forms, and the location information of the confidence score of an output encirclement frame, an output encirclement frame refers to coordinate, coordinate i.e. 4 Translate zooming parameter；

Step 4 includes the following steps：

Effect analysis：According to Fig.5, can be precisely located pedestrian be overlapped blocking position pedestrian, the present invention by Loss function is arranged in training stage, increases input semantic segmentation image and carries out team surveillance, realizes and pass through joint under low resolution Semantic segmentation information increases trained pedestrian's supervisory signals, provides the information of more pedestrians, contributes to pedestrian the area from background It branches away, solve existing pedestrian detection causes accuracy of detection is low to ask in low resolution pedestrian with the differentiation of background difficulty Topic improves performance of the pedestrian detection under real scene, avoids missing inspection.

Claims

1. a kind of pedestrian detection method based on semantic segmentation information, it is characterised in that：Include the following steps

Step 1：By core network, the corresponding semantic segmentation figure in original RGB image input overall network in training set sample As the branching networks being input in overall network, and the loss function that overall network is arranged completes training；

Step 2：Original RGB image input in test set sample is completed to core network in the overall network of training and carries out convolution spy Sign extraction generates multilayer feature figure；

Step 3：Multilayer feature figure is inputted to Area generation network in the overall network for completing training and carries out pedestrian candidate frame extraction Generate pedestrian candidate region；

Step 4：It completes defeated after the classification Recurrent networks in the overall network of training are classified and positioned to pedestrian candidate region Go out to include the detection result image of pedestrian position encirclement frame.

2. a kind of pedestrian detection method based on semantic segmentation information according to claim 1, it is characterised in that：The step Rapid 1 includes the following steps：

Step 1.2：Before original RGB image input core network, corresponding semantic segmentation image are input to branching networks completion The selection of scape sample and background sample simultaneously generates multilayer feature figure；

Step 1.3：Semantic segmentation loss function L pixel-by-pixel is determined based on the multilayer feature figure of semantic segmentation image_ss, formula used 1 is as follows：

Wherein, H indicates that characteristic pattern height, W indicate characteristic pattern width, p_x,yIndicate the characteristic value at characteristic pattern position (x, y), q_x,yIndicate that known corresponding supervisory signals, l (p, q) indicate cross entropy loss function, l (p, q)=- plogq- (1-p) log (1-q)；

Step 1.4：Multilayer feature figure based on original RGB image carries out Area generation network and the training for Recurrent networks of classifying is true Fixed corresponding loss function；

Wherein, λ_iIndicate loss proportion,Indicate the Classification Loss function of Area generation network,Indicate Area generation The candidate frame of network positions loss function,The Classification Loss function of presentation class Recurrent networks,Presentation class The positioning loss function of Recurrent networks, L_SSIndicate semantic segmentation loss function pixel-by-pixel.

3. a kind of pedestrian detection method based on semantic segmentation information according to claim 1, it is characterised in that：The step Rapid 3 include the following steps：

Step 3.1：After multilayer feature figure to be inputted to the Area generation network in the overall network for completing training, multilayer feature figure is big Small is W*H, and channel is the image of C, and each position based on image generates M*N candidate frame, and wherein W indicates characteristic pattern width, H Indicate that characteristic pattern height, C indicate that image channel number, M indicate that areal array number, N indicate ratio number of combinations；

Step 3.2：Layer of classifying in selection and step 3.1 Area generation network based on foreground sample and background sample exports image In belong to the probability of foreground and background on each position in M*N candidate frame, to classify layer before belonging to output in C dimensional features The confidence score of probability, that is, encirclement frame of scape and background；

Step 3.3：Layer, which is returned, based on candidate frame in step 3.1 Area generation network exports M*N candidate frame pair on each position It answers window to translate zooming parameter, is exported from C dimensional features for 4 translation scaling ginsengs of refine candidate frame to which candidate frame returns layer Number；

Step 3.4：Candidate frame is ranked up according to the confidence score of encirclement frame, highest scoring person is selected to pass through non-inhibited operation Obtain multiple candidate results i.e. pedestrian candidate region.

4. a kind of pedestrian detection method based on semantic segmentation information according to claim 1, it is characterised in that：The step Rapid 4 include the following steps：

Step 4.1：Pedestrian candidate region is inputted to the classification Recurrent networks in the overall network for completing training to position and classification As a result refine is completed using corresponding 4 translation zooming parameters；