CN110490174A

CN110490174A - Multiple dimensioned pedestrian detection method based on Fusion Features

Info

Publication number: CN110490174A
Application number: CN201910799142.1A
Authority: CN
Inventors: 程建; 李�灿; 林莉; 黄欣; 李恩泽
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2019-08-27
Filing date: 2019-08-27
Publication date: 2019-11-22

Abstract

The invention discloses a kind of multiple dimensioned pedestrian detection method based on Fusion Features, belong to the pedestrian detection technology field in computer vision, it solves in the prior art, to Fusion Features used in pedestrian detection, that multistage detection method will cause small target deteection precision is low, or detection time is long, resource requirement is high, thus the problem of being unable to reach real-time.The present invention includes: that the pedestrian detection data set that will acquire pre-processes；The multiple dimensioned pedestrian detection convolutional neural networks based on Fusion Features are constructed, multiple dimensioned pedestrian detection convolutional neural networks include the shared convolutional neural networks extracted for Fusion Features and the scale sub-network for detecting fusion feature；Pretreated pedestrian detection data set is input in multiple dimensioned pedestrian detection convolutional neural networks and is trained, the multiple dimensioned pedestrian detection convolutional neural networks after being trained；Pedestrian image to be detected is input to the multiple dimensioned pedestrian detection convolutional neural networks after training, obtains final testing result.

Description

Multiple dimensioned pedestrian detection method based on Fusion Features

Technical field

A kind of multiple dimensioned pedestrian detection method based on Fusion Features is used for multiple dimensioned pedestrian detection, belongs to computer view Pedestrian detection technology field in feel.

Background technique

Pedestrian detection is one of the important topic in computer vision and area of pattern recognition.Pedestrian detection can simply divide For two tasks: positioning；Classification.Positioning is exactly that the specific location of pedestrian in the picture is identified, obtains corresponding recurrence Frame.Classification is exactly to assign label to pedestrian target, since pedestrian detection only exists two classifications, i.e. pedestrian and background, so Classification task is easier to complete.Therefore, the most important task of pedestrian detection is exactly that pedestrian target is accurately positioned.Row The technologies such as people's detection technique has very strong use value, it can be identified with multi-human tracking, pedestrian again in conjunction with, be applied to automobile without People's control loop, intelligent robot, intelligent video monitoring, human body behavioural analysis, people flow rate statistical system, intelligent transportation field.

Since pedestrian has the double attribute of rigid objects and flexible article, various postures and shape, table are had See feature worn, posture, visual angle etc. influence it is very big, be in addition also faced with block, the factors such as illumination influence, this makes pedestrian's mesh Mark detection becomes an extremely challenging research direction in computer vision field.Currently used pedestrian detection method is main It is divided into two classes: the method based on motion detection and the method based on machine learning.Method main thought based on motion detection is Using background modeling, the foreground target of movement is extracted, is then classified using foreground target of the classifier to movement, is judged It whether is pedestrian.Method based on machine learning is the mainstream algorithm in current pedestrian detection field.Method based on machine learning It can be further divided into and the method for classifier be added based on manual feature and based on the method for deep learning.But the above method is all It cannot smoothly solve the problems, such as multiple dimensioned in pedestrian target detection field, this is mainly due to large scale pedestrian targets and small scale The opinion of the feature of pedestrian target is difficult to be resolved by prior art means there are biggish difference.

Current existing solution mainly has Fusion Features, multistage detection method.Fusion Features although it is available more Add the feature of robustness, but since there are great differences for the feature of the pedestrian target of the pedestrian target and large scale of small scale, For example, the body bone of large-sized pedestrian target can provide information abundant for human body target detection, but small size Pedestrian target lacks enough skeleton characters, therefore Feature fusion not can solve the problem, to make small sized Pedestrian target, the i.e. detection accuracy of Small object reduce.Multistage detection method, that is to say, that obtained first by a detection network This preliminary testing result, is then re-entered into that network trunk is continuous to be detected by preliminary testing result.This method Belong to serial approach, the testing result of back depend critically upon front as a result, the fortune of network can not be accelerated by parallel computation Scanning frequency degree.In addition, if there are multiple pedestrians in the inside, preliminary testing result just has multiple and different for a picture Region, it would be desirable to these different zones are sent in network and are trained, it is assumed that say that Preliminary detection has 20 detection blocks, So we just need to carry out the forward calculation of 20 networks in the second level is detected, if continuing cascade detectors below, that Forward calculation more times is just needed, on the one hand such case increases the computational resource requirements of hardware, another aspect network pushes away Drill overlong time, it is difficult to reach real-time.Therefore a kind of multiple dimensioned method of pedestrian detection that effectively solves is urgently to be resolved asks Topic.

Summary of the invention

Aiming at the problem that the studies above, the multiple dimensioned pedestrian inspection based on Fusion Features that the purpose of the present invention is to provide a kind of Survey method, solves in the prior art, will cause small target deteection essence to Fusion Features used in pedestrian detection, multistage detection method Low or detection time length, resource requirement height are spent, thus the problem of being unable to reach real-time.

In order to achieve the above object, the present invention adopts the following technical scheme:

A kind of multiple dimensioned pedestrian detection method based on Fusion Features, includes the following steps:

S1: the pedestrian detection data set that will acquire is pre-processed；

S2: multiple dimensioned pedestrian detection convolutional neural networks of the building based on Fusion Features, multiple dimensioned pedestrian detection convolution mind It include the shared convolutional neural networks extracted for Fusion Features and the scale sub-network for detecting fusion feature through network；

S3: pretreated pedestrian detection data set is input in multiple dimensioned pedestrian detection convolutional neural networks and is instructed Practice, the multiple dimensioned pedestrian detection convolutional neural networks after being trained；

S4: pedestrian image to be detected is input to the multiple dimensioned pedestrian detection convolutional neural networks after training, is obtained Final testing result.

Further, the pedestrian detection data set in the step S1 is the training set extracted from Caltech data set, Caltech data set has 11 file Set00~Set10, and each file includes multiple videos, wherein the resolution of video Rate is 640*480；

Pretreatment refers to the VOC data format that each frame image in pedestrian detection data set is converted to standard, regenerates The file of corresponding band mark, file format .xml, i.e. file subsequent are .xml.

Further, the shared convolutional neural networks in the step S2 successively include the first convolutional layer, the first BN layers, first ReLu layers, the second convolutional layer, the 2nd BN layers, the 2nd ReLu layers, maximum pond layer, the first intensive residual error module, the first maximum pond Change layer, the second intensive residual error module, the second maximum pond layer, the intensive residual error module of third, third maximum pond layer；First is intensive Residual error module, the second intensive residual error module and the intensive residual error module of third include three convolutional layers and the residual error structure of ResNet, Wherein, first convolutional layer is used to carry out the port number of input data dimension-reduction treatment, and second convolutional layer is used for first The result of convolutional layer output carries out channel and rises dimension processing, and the residual error structure of ResNet is used for input data and second convolutional layer Output result be added, obtain it is after being added as a result, third convolutional layer be used for result after being added carry out dimensionality reduction at Reason, the processing result of three convolutional layers is intensively connected, i.e., the feature of different levels is overlapping together, each intensive residual error Module exports the feature for having shallow-layer feature and further feature.

Further, the dimension size of the shared convolutional neural networks is 28*28*512, the first convolutional layer and the second convolution Layer core size be 3*3, step-length 1, padding SAME；Maximum pond layer, the first maximum pond layer, the second maximum pond Layer and third maximum pond layer core size be 2*2, step-length 2；First convolutional layer is the convolutional layer of 1*1* (c/2), 1*1 table Show core size, the convolutional layer that second convolutional layer is 1*1*c, 1*1 indicates core size, and third convolutional layer is 1*1* (c/2) Convolutional layer, 1*1 indicate core size, wherein the port number of c expression characteristic pattern.

Further, in the step S2, scale sub-network includes core network, branching networks, core network and branched network The result of network output is weighted, as final testing result；

The core network includes for detecting the large scale sub-network of large scale target and detecting the small of small scaled target Scale sub-network, large scale sub-network and small scale sub-network successively include the first convolutional layer, the first BN layers, the first ReLu layers, Second convolutional layer, the 2nd BN layers, the 2nd ReLu layers, the first maximum pond layer, third convolutional layer, the 3rd BN layers, the 3rd ReLu layers, Volume Four lamination, the 4th BN layers, the 4th ReLu layers, the second maximum pond layer, intensive residual error module, the 5th convolutional layer, the 5th BN Layer, the 5th ReLu layers and loss function；

The branching networks include according to share convolutional neural networks output result height to large scale sub-network with The scale perceptual weighting layer of weight is assigned in the output of small scale sub-network；Weight computing formula in scale perceptual weighting layer are as follows:

Wherein, ω_lFor the weight of large scale sub-network, ω_sThe weight of small scale sub-network is represented,Represent pedestrian target Average height, α and β are proportionality coefficients, optimize the two parameters by backpropagation, h indicates the height of any pedestrian target.

Further, the first convolutional layer in the core network, the second convolutional layer, third convolutional layer and Volume Four lamination Core size is 3*3, and the core size of the 5th convolutional layer is 3*3, step-length 1, padding SAME；First maximum pond layer and the The core size of two maximum pond layers is 2*2, step-length 2.

Further, the detection framework of the output of the 5th convolutional layer of the large scale sub-network and small scale sub-network uses YoLo algorithm, the anchor point in YoLo algorithm utilize pedestrian's bbox depth-width ratio in k-means clustering pedestrian detection data set Feature obtains, and the area of anchor point is set as 7*7, and under the area size, choosing length-width ratio is respectively { 3: 1,5: 2,5: 3 }, In, bbox is callout box.

Further, the loss function is the weighting of intersection entropy loss and the Smooth L1 based on positioning based on classification With, using stochastic gradient descent method be optimization method, initial learning rate be set as 0.001, loss no longer decline as training knot Beam condition.

Further, in the step S3, the shared convolutional neural networks of pre-training on ImageNet data set are used The initiation parameter of initial parameter of the parameter as shared convolutional neural networks, the sub-network based on scale uses distribution initialization Parameter, i.e., common deep learning initialization mode；In the training of the multiple dimensioned pedestrian detection convolutional neural networks, pass through ladder Degree decline carries out backpropagation, carries out parameter update.

The present invention compared with the existing technology, its advantages are shown in:

One, the present invention is based on the shared convolutional neural networks of Fusion Features, take full advantage of convolutional network shallow-layer feature with Further feature, shallow-layer feature resolution is big, is conducive to accurate positioning；Further feature semantic information is abundant, and it is correct to be conducive to classification；

Two, the present invention is based on the shared convolutional neural networks of Fusion Features simultaneously, combines scale sub-network, solves pedestrian Small scale and large scale sub-network are fused in unified frame, provide by the different problem of target scale in detection process A kind of settling mode end to end, finally obtains good pedestrian detection result；

Three, the multiple dimensioned pedestrian detection frame of the present invention based on Fusion Features is based on YoLo, for pedestrian The characteristics of depth-width ratio, design are suitable for the bounding box of pedestrian detection, i.e., are clustered using k-means thought to labeled data, The depth-width ratio distribution situation with pedestrian is obtained, so that pedestrian detection result is more anxious accurate.

Detailed description of the invention

Fig. 1 is flow diagram of the invention；

Fig. 2 is the structural schematic diagram of intensive residual error module in the present invention, wherein w is characterized figure width, h is characterized figure height Degree, c are characterized figure port number；

Fig. 3 is the structural schematic diagram of multiple dimensioned pedestrian detection convolutional neural networks in the present invention, wherein conv indicates convolution Layer, stride indicate that step-length, pool indicate that maximum pond layer, DRM indicate that intensive residual error module, weight1 indicate large scale net Weight, the weight2 of network indicate the weight of small scale network；

Fig. 4 is the schematic diagram of the multiple dimensioned pedestrian detection frame based on Fusion Features in the present invention；

Fig. 5 is in embodiment using YOLO detection method testing result schematic diagram in the prior art；

Fig. 6 is the testing result detected in embodiment using multiple dimensioned pedestrian detection convolutional neural networks detection method Schematic diagram.

Specific embodiment

Below in conjunction with the drawings and the specific embodiments, the invention will be further described.

S1: the pedestrian detection data set that will acquire is pre-processed；Pedestrian detection data set is from Caltech data set The training set of extraction, Caltech data set have 11 file Set00~Set10, and each file includes multiple videos, In, the resolution ratio of video is 640*480；

Pretreatment refers to the VOC data format that each frame image in pedestrian detection data set is converted to standard, regenerates The file of corresponding band mark, file format .xml, i.e. file suffixes are .xml.

S2: multiple dimensioned pedestrian detection convolutional neural networks of the building based on Fusion Features, multiple dimensioned pedestrian detection convolution mind It include the shared convolutional neural networks extracted for Fusion Features and the scale sub-network for detecting fusion feature through network； Enjoy convolutional neural networks successively and include the first convolutional layer, the first BN layers, the first ReLu layers, the second convolutional layer, the 2nd BN layers, second ReLu layers, maximum pond layer, the first intensive residual error module, the first maximum pond layer, the second intensive residual error module, the second maximum pond Change layer, the intensive residual error module of third, third maximum pond layer；First intensive residual error module, the second intensive residual error module and third Intensive residual error module includes three convolutional layers and the residual error structure of ResNet, wherein first convolutional layer is used for input data Port number carry out dimension-reduction treatment, second convolutional layer is used to carry out the result of first convolutional layer output in channel to rise Wei Chu Reason, the residual error structure of ResNet are used for the defeated of input data and second convolutional layer through " shortcut connection " Result is added out, obtains after being added as a result, third convolutional layer is used to incite somebody to action result after being added progress dimension-reduction treatment The processing result of three convolutional layers is intensively connected, i.e., the feature of different levels is overlapping together, each intensive residual error module Output has the feature of shallow-layer feature and further feature.The dimension size of the shared convolutional neural networks is 28*28*512, the One convolutional layer and the second convolutional layer core size be 3*3, port number 64, step-length 1, padding SAME；Maximum pond layer, The core size of first maximum pond layer, the second maximum pond layer and third maximum pond layer is 2*2, step-length 2；First convolution Layer is the convolutional layer of 1*1* (c/2), and 1*1 indicates core size, the convolutional layer that second convolutional layer is 1*1*c, and 1*1 indicates that core is big Small, third convolutional layer is the convolutional layer of 1*1* (c/2), and 1*1 indicates core size, wherein the port number of c expression characteristic pattern.

Scale sub-network includes core network, branching networks, and the result that core network and branching networks export is weighted, As final testing result；

The core size of the first convolutional layer, the second convolutional layer, third convolutional layer and Volume Four lamination in the core network For 3*3, port number 512, the core size of the 5th convolutional layer is 3*3, port number 12, step-length 1, padding SAME；The The core size of one maximum pond layer and the second maximum pond layer is 2*2, step-length 2.

The detection framework of the output of 5th convolutional layer of the large scale sub-network and small scale sub-network is calculated using YoLo Method, the anchor point in YoLo algorithm are obtained using pedestrian's bbox depth-width ratio feature in k-means clustering pedestrian detection data set It arrives, the area of anchor point is set as 7*7, and under the area size, choosing length-width ratio is respectively { 3: 1,5: 2,5: 3 }, wherein bbox For callout box.

The loss function is the weighted sum of intersection entropy loss and the Smooth L1 based on positioning based on classification, is used Stochastic gradient descent method is optimization method, and initial learning rate is set as 0.001, loss and no longer declines as training termination condition.

S3: pretreated pedestrian detection data set is input in multiple dimensioned pedestrian detection convolutional neural networks and is instructed Practice (training method is existing mode), the multiple dimensioned pedestrian detection convolutional neural networks after being trained；Even if being used in Initial parameter of the parameter of the shared convolutional neural networks of pre-training as shared convolutional neural networks on ImageNet data set, The initiation parameter of sub-network based on scale uses distribution initiation parameter, i.e., common deep learning initialization mode；Institute In the training for stating multiple dimensioned pedestrian detection convolutional neural networks, backpropagation is carried out by stochastic gradient descent, carries out parameter more Newly.

Embodiment

Test set is extracted from Caltech data set, the pedestrian image of the 448*448*3 to be detected in test set is inputted Multiple dimensioned pedestrian detection convolutional neural networks to after training, obtain final testing result, as shown in Figure 6.

The pedestrian image of 448*448*3 to be detected in test set is detected using existing YOLO mode, is obtained The testing result arrived is as shown in Figure 5.

By experiment it is found that the multiple dimensioned pedestrian detection convolution in YOLO detection method in the prior art, the present invention is refreshing Through network detecting method and the present invention in multiple dimensioned pedestrian detection convolutional neural networks+YOLO algorithm detection accuracy mAP and The contrast table of one second video frame number FPS that can be handled is as follows:

If only clearly can see that detection accuracy is very poor using YOLO, that is, be directed to the detection accuracy of Small object It is very poor, this is mainly due to the small scale pedestrian accounting of Caltech data set is very big, and contains and hidden in labeled data data Target is kept off, the difficulty of detection is increased.And the very good solution of the present invention different problem of pedestrian detection mesoscale, therefore at this It is showed on data set good.Meanwhile multiple dimensioned pedestrian detection convolutional neural networks+YOLO algorithm combines, i.e., not only 7*7's It is detected on feature map, the last one experimental selection is detected on { 7*7,14*14 } two scales, final to detect As a result 2% or so are all promoted, the present invention significantly improves detection essence in the case where FPS is reduced little by little in summary Degree.

The above is only the representative embodiment in the numerous concrete application ranges of the present invention, to protection scope of the present invention not structure At any restrictions.It is all using transformation or equivalence replacement and the technical solution that is formed, all fall within rights protection scope of the present invention it It is interior.

Claims

1. a kind of multiple dimensioned pedestrian detection method based on Fusion Features, which comprises the steps of:

S1: the pedestrian detection data set that will acquire is pre-processed；

S2: multiple dimensioned pedestrian detection convolutional neural networks of the building based on Fusion Features, multiple dimensioned pedestrian detection convolutional Neural net Network includes the shared convolutional neural networks extracted for Fusion Features and the scale sub-network for detecting fusion feature；

S3: pretreated pedestrian detection data set being input in multiple dimensioned pedestrian detection convolutional neural networks and is trained, Multiple dimensioned pedestrian detection convolutional neural networks after being trained；

S4: being input to the multiple dimensioned pedestrian detection convolutional neural networks after training for pedestrian image to be detected, obtains final Testing result.

2. a kind of multiple dimensioned pedestrian detection method based on Fusion Features according to claim 1, which is characterized in that described Pedestrian detection data set in step S1 is the training set extracted from Caltech data set, and Caltech data set has 11 texts Part presss from both sides Set00~Set10, and each file includes multiple videos, wherein the resolution ratio of video is 640*480；

Pretreatment refers to the VOC data format that each frame image in pedestrian detection data set is converted to standard, and regeneration corresponds to With mark file, file format .xml, i.e. file suffixes be .xml.

3. a kind of multiple dimensioned pedestrian detection method based on Fusion Features according to claim 1 or 2, which is characterized in that Shared convolutional neural networks in the step S2 successively include the first convolutional layer, the first BN layers, the first ReLu layers, the second convolution Layer, the 2nd BN layer, the 2nd ReLu layer, it is maximum pond layer, the first intensive residual error module, the first maximum pond layer, second intensively residual Difference module, the second maximum pond layer, the intensive residual error module of third, third maximum pond layer；It is first intensive residual error module, second close Collect residual error module and the intensive residual error module of third includes three convolutional layers and the residual error structure of ResNet, wherein first convolution Layer carries out dimension-reduction treatment, the result that second convolutional layer is used to export first convolutional layer for the port number to input data It carries out channel and rises dimension processing, the residual error structure of ResNet is used to input data and the output result of second convolutional layer carrying out phase Add, obtains after being added as a result, third convolutional layer is used to carry out dimension-reduction treatment to result after being added, by three convolutional layers Processing result is intensively connected, i.e., the feature of different levels is overlapping together, and each intensive residual error module output has shallow-layer The feature of feature and further feature.

4. a kind of multiple dimensioned pedestrian detection method based on Fusion Features according to claim 3, which is characterized in that described The dimension size of convolutional neural networks is shared for 28*28*512, and the core size of the first convolutional layer and the second convolutional layer is 3*3, walks A length of 1, padding SAME；Maximum pond layer, the first maximum pond layer, the second maximum pond layer and third maximum pond layer Core size be 2*2, step-length 2；First convolutional layer is the convolutional layer of 1*1* (c/2), and 1*1 indicates core size, second volume Lamination is the convolutional layer of 1*1*c, and 1*1 indicates core size, and third convolutional layer is the convolutional layer of 1*1* (c/2), and 1*1 indicates that core is big It is small, wherein the port number of c expression characteristic pattern.

5. a kind of multiple dimensioned pedestrian detection method based on Fusion Features according to claim 1 or 2, which is characterized in that In the step S2, scale sub-network includes core network, branching networks, the result progress of core network and branching networks output Weighting, as final testing result；

The core network includes for detecting the large scale sub-network of large scale target and detecting the small scale of small scaled target Sub-network, large scale sub-network and small scale sub-network successively include the first convolutional layer, the first BN layers, the first ReLu layers, second Convolutional layer, the 2nd BN layers, the 2nd ReLu layers, the first maximum pond layer, third convolutional layer, the 3rd BN layers, the 3rd ReLu layers, the 4th Convolutional layer, the 4th BN layers, the 4th ReLu layers, the second maximum pond layer, intensive residual error module, the 5th convolutional layer, the 5th BN layers, the Five ReLu layers and loss function；

The branching networks include the height according to the output result for sharing convolutional neural networks to large scale sub-network and small ruler The scale perceptual weighting layer of weight is assigned in the output for spending sub-network；Weight computing formula in scale perceptual weighting layer are as follows:

Wherein, ω_lFor the weight of large scale sub-network, ω_sThe weight of small scale sub-network is represented,Represent being averaged for pedestrian target Highly, α and β is proportionality coefficient, optimizes the two parameters by backpropagation, h indicates the height of any pedestrian target.

6. a kind of multiple dimensioned pedestrian detection method based on Fusion Features according to claim 5, which is characterized in that described The core size of the first convolutional layer, the second convolutional layer, third convolutional layer and Volume Four lamination in core network be 3*3, volume five The core size of lamination is 3*3, step-length 1, padding SAME；The core of first maximum pond layer and the second maximum pond layer is big Small is 2*2, step-length 2.

7. a kind of multiple dimensioned pedestrian detection method based on Fusion Features according to claim 6, which is characterized in that described The detection framework of the output of 5th convolutional layer of large scale sub-network and small scale sub-network uses YoLo algorithm, in YoLo algorithm Anchor point obtained using pedestrian's bbox depth-width ratio feature in k-means clustering pedestrian detection data set, the area of anchor point It is set as 7*7, under the area size, choosing length-width ratio is respectively { 3: 1,5: 2,5: 3 }, wherein bbox is callout box.

8. a kind of multiple dimensioned pedestrian detection method based on Fusion Features according to claim 7, which is characterized in that described Loss function is the weighted sum of intersection entropy loss and the Smooth L1 based on positioning based on classification, uses stochastic gradient descent Method is optimization method, and initial learning rate is set as 0.001, loss and no longer declines as training termination condition.

9. a kind of multiple dimensioned pedestrian detection method based on Fusion Features according to claim 1, which is characterized in that described In step S3, use the parameter of the shared convolutional neural networks of pre-training on ImageNet data set as shared convolutional Neural The initiation parameter of the initial parameter of network, the sub-network based on scale uses distribution initiation parameter, i.e. common depth Practise initialization mode；In the training of the multiple dimensioned pedestrian detection convolutional neural networks, carried out by stochastic gradient descent reversed It propagates, carries out parameter update.