CN110781744A

CN110781744A - Small-scale pedestrian detection method based on multi-level feature fusion

Info

Publication number: CN110781744A
Application number: CN201910899189.5A
Authority: CN
Inventors: 陈泽; 叶学义; 钱丁炜; 魏阳洋; 赵知劲
Original assignee: Hangzhou Electronic Science and Technology University
Current assignee: Hangzhou Dianzi University; Hangzhou Electronic Science and Technology University
Priority date: 2019-09-23
Filing date: 2019-09-23
Publication date: 2020-02-11

Abstract

The invention discloses a small-scale pedestrian detection method based on multi-level feature fusion. The method comprises the steps of firstly, extracting features of an input image by using a convolutional neural network, fusing shallow features with rich detail information and deep features with rich semantic information, then pooling and sharing the multi-layer fused features in pedestrian candidate regions in pedestrian region proposing and classifying regression, and finally obtaining a detection result. The invention combines the whole detection process of feature extraction, region proposal and classification regression into a complete network through the designed loss function, thereby training the whole network end to end and the trained detection network can be directly applied to pedestrian detection. The invention can keep better robustness and high efficiency under the automatic driving and monitoring scene with a large number of small-scale pedestrians.

Description

Small-scale pedestrian detection method based on multi-level feature fusion

Technical Field

The invention belongs to the technical field of target detection of computer vision, and particularly relates to a pedestrian detection method based on multi-level feature fusion.

Background

Pedestrian detection has been a hot and difficult point in computer vision research as an example of target detection, and with the rapid development of computer vision technology and artificial intelligence over the last decades, pedestrian detection has become one of the subjects of popular research in various fields including social security, public transportation, internet development, and the like. The pedestrian detection aims to solve the problems that: and finding out the positions and sizes of all pedestrians in the image or the video, including the pedestrians. The technology is the basis and the premise of researches such as pedestrian tracking, gait recognition and pedestrian behavior analysis, and a good pedestrian detection algorithm can provide powerful support and guarantee for the pedestrian detection algorithm.

The pedestrian is one of the most difficult objects to detect in the target detection, because the background of the pedestrian is complicated and changeable, and the problems of occlusion, undersize and the like are more serious compared with the general target detection. The small-scale pedestrians have the greatest influence on the detection performance and are more common, for example, in an automatic driving scene, many small pedestrians far away from an automobile exist, the targets are difficult to detect by using a laser radar, and computer vision is needed to help in the process, so that a detection method for the small-scale pedestrians is needed.

Disclosure of Invention

The invention aims to provide a small-scale pedestrian detection method based on multi-level feature fusion under the scene of a large number of small-scale pedestrians, such as automatic driving, video monitoring and the like, so that the accuracy of pedestrian detection is improved. The method is a complete deep learning detection network, firstly, a convolutional neural network is used for carrying out feature extraction on an input image, shallow features with rich detail information and deep features with rich semantic information are fused, then, multi-layer fused features are shared through pedestrian candidate region pooling in pedestrian region proposing and classifying regression, and finally, a detection result is obtained. The method combines the whole detection process of feature extraction, region proposal and classification regression into a complete network through the designed loss function, thereby training the whole network end to end and directly applying the trained detection network to pedestrian detection.

The invention specifically comprises the following steps:

step 1, inputting images in a training set sample into a convolutional neural network to extract features, then generating a pedestrian candidate region by a pedestrian region proposing network according to the extracted features, mapping the pedestrian candidate region back to a feature map extracted by the convolutional neural network and pooling the feature map into feature vectors with the same size, sending the feature vectors into a classification regression network to perform classification regression, and finally setting a loss function of an overall network to complete end-to-end training;

step 2, inputting the images in the test set into a convolutional neural network for feature extraction to obtain a multilayer convolutional feature map, and fusing shallow features with rich detail information and deep features with rich semantic information;

step 3, inputting a feature map obtained after the fusion of the multilayer features into a pedestrian area proposing network to generate a pedestrian candidate area;

and 4, mapping the pedestrian candidate region back to the obtained feature map after multilayer fusion, and then pooling into feature vectors with the same size.

And 5, inputting the pooled vectors into a classification regression network to perform further classification regression on the pedestrian candidate region, and finally obtaining the detection result of the pedestrian.

The invention has the following beneficial effects:

the method performs feature fusion on the convolution feature maps obtained by the plurality of convolution layers, and the strategy simultaneously considers the detail information rich in shallow features and the semantic information abstract in deep features, enriches the features of small-scale pedestrians and is beneficial to improving the classification of the small-scale pedestrians; the invention maps pedestrian candidate areas with different sizes back to the feature map, avoids respectively carrying out convolution feature extraction on each candidate area, realizes the multiplexing of features, and greatly improves the detection speed; the pedestrian candidate region feature maps with different sizes are changed into the same size through pedestrian candidate region pooling so as to facilitate classification regression, and pixel points which are difficult to divide completely in the pooling process are supplemented by bilinear interpolation in the pooling process, so that quantization operation is avoided, and the positioning accuracy of pedestrians is improved.

The method can keep better robustness and high efficiency under the automatic driving and monitoring scene with a large number of small-scale pedestrians, and therefore has better application value in practice.

Drawings

Fig. 1 is a flow chart of pedestrian detection designed by the invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

The small-scale pedestrian detection method based on multi-level feature fusion specifically comprises the following steps:

step 1, inputting images in a training set sample into a VGG-16 convolutional neural network trained in advance on an ImageNet data set to extract features, then generating a pedestrian candidate region by a pedestrian region proposing network according to the extracted features, mapping the pedestrian candidate region back to a feature map extracted by the convolutional neural network, pooling the feature map into feature vectors with the same size, sending the feature vectors into a classification regression network to perform classification regression, and finally setting a loss function of an integral network to finish end-to-end training;

step 2, inputting the images in the test set into a VGG-16 convolutional neural network (the network has 5 convolutional modules) trained in advance on the ImageNet data set for feature extraction to obtain a plurality of convolutional feature maps, and performing feature fusion on the feature maps after the 3 rd, 4 th and 5 th convolutional modules;

and 4, pooling the pedestrian candidate region features into feature vectors with the same size.

The step 1 of setting a loss function of the whole network to complete end-to-end training specifically includes:

1-1, the pedestrian candidate region having the largest overlap-merge ratio (IoU) with the labeling box or having a merge-merge ratio with IoU of the labeling box of more than 0.6 is assigned a positive label, and the pedestrian candidate region having IoU lower than 0.2 is assigned a negative label. IoU is calculated as:

1-2 design loss function L ({ p) of the overall detection network _i},{t _i})：

Where i is an index of a pedestrian candidate region in a small batch of data, p _iIs the predicted probability that the pedestrian candidate area i is the foreground. If the pedestrian candidate area is a positive sample, its true label Is 1, and is 0 if the pedestrian candidate region is a negative sample. t is t _iIs a vector representing 4 parameterized coordinates of the predicted bounding box, and

is the vector of the true label box associated with the right pedestrian candidate region. Loss of classification L _clsIs the log loss on both classes (foreground or background). Loss of return

Indicating that the regression loss is active only for positive samples, otherwise disabled

And 2, performing feature fusion on the feature maps obtained after the 3 rd, 4 th and 5 th convolution modules, specifically as follows:

2-1, performing convolution on the input image by using a convolution neural network (VGG-16) which is provided with 5 convolution modules and is trained on an ImageNet data set in advance to extract features, wherein each convolution module generates a convolution feature map.

2-2, because the resolution of the feature map after each layer is reduced due to the Pooling operation, the feature map after 5 th layer convolution with smaller size is up-sampled to the size of the feature map after 4 th layer convolution by a bilinear interpolation mode, and the feature map after 3 rd layer convolution with larger size is down-sampled to the size of the feature map after 4 th layer convolution by a Max Pooling mode.

2-3, respectively normalizing the feature maps of the different layers with adjusted sizes through a Batch Normalization layer before the feature channels are overlapped.

And 2-4, carrying out channel fusion on the convolution characteristic graphs after the 3 rd, 4 th and 5 th convolution modules in a channel superposition mode.

2-5, finally adding 1 × 1 convolutional layer to reduce the dimension of the fusion feature to 512, and adding nonlinear activation function Relu (the reconstructed Linear Unit) after convolutional layer, which also improves the expression capability of the detection network.

The pedestrian area proposing network described in step 3 generates a pedestrian candidate area, specifically as follows:

and 3-1, generating a pedestrian reference frame with the aspect ratio of 0.41 and 11 scales by taking each point as the center on a convolution feature map obtained after multi-level feature fusion.

3-2, and then sliding a 3 x 3 window on the convolved feature map for convolution, each sliding window generating a 512-dimensional feature vector.

And 3-3, respectively inputting the generated feature vectors into two 1 x 1 full-connection layers, wherein one classification layer outputs a reference frame as the confidence coefficient of the foreground, and the other regression layer outputs the coordinate offset of the reference frame compared with the labeling frame.

And 3-4, finally outputting the candidate region only containing the pedestrian according to the confidence coefficient and the coordinate offset.

The pedestrian candidate area pooling in the step 4 is specifically as follows:

and 4-1, mapping the pedestrian candidate region to a feature map obtained by multi-level feature fusion.

4-2, pooling feature maps corresponding to pedestrian candidate areas with different sizes into a fixed size of 7 x 512 required by a full connection layer, and supplementing points which cannot be evenly divided in the pooling process in a bilinear interpolation mode.

The pedestrian candidate region classification regression of step 5 specifically includes:

and 5-1, respectively sending the pedestrian candidate region feature maps with the same size after the pooling into two full-connection layers, wherein one full-connection layer is used for classifying to reduce false detection of pedestrians, and the other full-connection layer is used for returning to the position to enable the positioning of the pedestrians to be more accurate.

And 5-2, carrying out non-maximum value suppression processing on the finally generated pedestrian detection frame to eliminate repeated detection frames containing the same pedestrian, and finally obtaining the detection result of the pedestrian.

Claims

1. A small-scale pedestrian detection method based on multi-level feature fusion is characterized by comprising the following steps:

step 1, inputting images in a training set sample into a VGG-16 convolutional neural network trained in advance on an ImageNet data set to extract features, then generating a pedestrian candidate region according to the extracted features by a pedestrian region proposing network, mapping the pedestrian candidate region back to a feature map extracted by the convolutional neural network, pooling the feature map into feature vectors with the same size, sending the feature vectors into a classification regression network to perform classification regression, and finally setting a loss function of an integral network to finish end-to-end training;

step 2, inputting the images in the test set into a convolutional neural network for feature extraction to obtain a multilayer convolutional feature map, and performing multilayer feature fusion on the shallow features with rich detail information and the deep features with rich semantic information;

step 4, pooling the pedestrian candidate region features into feature vectors with the same size;

and 5, inputting the pooled feature vectors into a classification regression network to perform further classification regression on the pedestrian candidate region, and finally obtaining the detection result of the pedestrian.

2. The method for detecting the small-scale pedestrian with the multi-level feature fusion as claimed in claim 1, wherein the step 1 of setting the loss function of the whole network to complete the end-to-end training comprises the following specific steps:

1-1, firstly, assigning a pedestrian candidate area which has the largest overlap-merge ratio (IoU) with the labeling frame or exceeds IoU of the labeling frame by 0.6 with a positive label, and assigning a pedestrian candidate area of IoU below 0.2 with a negative label; IoU is calculated as:

Where i is an index of a pedestrian candidate region in a small batch of data, p _iA prediction probability that the pedestrian candidate region i is a foreground; if the pedestrian candidate area is a positive sample, its true label 1, if the pedestrian candidate area is a negative sample, 0; t is t _iIs a vector representing 4 parameterized coordinates of the predicted bounding box, and is the vector of the true label box associated with the right pedestrian candidate region; loss of classification L _clsIs the log loss on both classes (foreground or background); loss of return

3. The multi-level feature fusion small-scale pedestrian detection method according to claim 1 or 2, wherein the multi-level feature fusion in step 2 is implemented as follows:

2-1, performing convolution on an input image by using a convolution neural network which is provided with 5 convolution modules and is trained on an ImageNet data set in advance to extract features, wherein a convolution feature map is generated after each convolution module;

2-2, because the resolution of the feature map after each layer is reduced due to the Pooling operation, up-sampling the 5 th layer convolved feature map with smaller size to the size of the 4 th layer convolved feature map in a bilinear interpolation mode, and down-sampling the 3 rd layer convolved feature map with larger size to the size of the 4 th layer convolved feature map in a Max Pooling mode;

2-3, respectively normalizing the feature maps of the different layers with adjusted sizes through a Batch Normalization layer before the feature channels are superposed;

2-4, performing channel fusion on the convolution characteristic graphs after the 3 rd, 4 th and 5 th convolution modules in a channel superposition mode;

2-5, adding a 1 × 1 convolutional layer to reduce the dimension of the fused features to 512, adding a nonlinear activation function Relu after the convolutional layer, and finally inputting a feature map obtained by fusing the multilayer features into a pedestrian region proposal network to generate a pedestrian candidate region.

4. The method according to claim 3, wherein the pedestrian region proposal network in step 3 generates pedestrian candidate regions, specifically as follows:

3-1, generating a pedestrian reference frame with the aspect ratio of 0.4 and 12 scales by taking each point as the center on a convolution feature map obtained after multi-level feature fusion;

3-2, sliding a 3 x 3 window on the convolution feature map for convolution operation, and generating a 512-dimensional feature vector by each sliding window;

3-3, respectively inputting the generated feature vectors into two 1 x 1 full-connection layers, wherein one classification layer outputs a reference frame as the confidence coefficient of the foreground, and the other regression layer outputs the coordinate offset of the reference frame compared with the marking frame;

5. The method for detecting pedestrians with multi-level feature fusion according to claim 4, wherein the pedestrian candidate area pooling in step 4 is as follows:

4-1, mapping the pedestrian candidate region to a feature map obtained by multi-level feature fusion;

6. The method for detecting the small-scale pedestrian by fusing the multi-level features according to claim 5, wherein the step 5 is as follows:

5-1, respectively sending the characteristic graphs of the pedestrian candidate regions with the same size after pooling into two full-connection layers, wherein one full-connection layer is used for classification so as to reduce false detection of pedestrians; one for returning the position to make the pedestrian's location more accurate;