CN113033371A

CN113033371A - CSP model-based multi-level feature fusion pedestrian detection method

Info

Publication number: CN113033371A
Application number: CN202110295911.1A
Authority: CN
Inventors: 宦若虹; 谢超杰
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2021-03-19
Filing date: 2021-03-19
Publication date: 2021-06-25

Abstract

A CSP model-based multi-level feature fusion pedestrian detection method adopts a CSP framework, uses CNN to extract pedestrian features, then a network is divided into 3 branches to respectively predict a target center point, a target height and a center point offset, after image preprocessing, PycnvResNet-101 is used as a feature extraction network to extract feature maps from input images, the obtained feature maps in different stages are subjected to multi-level fusion to obtain final feature maps, the final feature maps are sent to the prediction network, the prediction network is trained by using Focal local and Smooth L1, the target center point, the target height and the center point offset in the prediction maps are generated into target detection frames, and redundant detection frames are removed by using a non-maximum suppression algorithm to obtain a final detection result. The invention can fully integrate the abundant semantic information of the high-level characteristic diagram and the abundant position information of the low-level characteristic diagram, and effectively reduce false detection and missing detection under the conditions of small targets and serious shielding.

Description

CSP model-based multi-level feature fusion pedestrian detection method

Technical Field

The invention relates to the field of computer vision and target detection, in particular to a pedestrian detection method facing to videos.

Background

Computer vision has been a hot point and a difficult point of research in computer science, and pedestrian detection, as a subtask of target detection, has become a very important research problem in the field of computer vision. Convolutional Neural Networks (CNNs) have shown great power in the fields of computer vision and object detection in recent years. The development of many CNN-based general target detection methods has facilitated the development of research and application of pedestrian detection directions. But the pedestrian detection technology still has a great promotion space at present. The main problem is that the feature information of small objects and severely occluded objects is difficult to extract, resulting in missed detection and false detection. The csp (center and Scale prediction) is a pedestrian detection algorithm proposed in 2019, which learns pedestrian features through CNN, predicts central point coordinates and size information of a pedestrian target, and completes a pedestrian detection task.

Disclosure of Invention

Aiming at the problem of false detection caused by small targets and severe occlusion in pedestrian detection, the invention provides a CSP model-based multistage feature fusion pedestrian detection method, which can fully fuse rich semantic information of a high-level feature map and rich position information of a low-level feature map, and effectively reduce false detection and false detection under the conditions of small targets and severe occlusion.

A multi-level feature fusion pedestrian detection method based on a CSP model comprises the following steps:

step 1, adopting a CSP (compact size distribution) framework, extracting pedestrian features by using CNN (CNN), then respectively predicting a target central point, a target height and a central point offset by dividing a network into 3 branches, preprocessing a training image in a training stage, and then inputting the preprocessed training image into the network, wherein the preprocessing comprises the steps of adjusting the size of the image to set pixels, randomly cutting the image and adjusting the brightness, extracting the pedestrian features by using PycnvResNet-101 as a feature extraction network, performing multi-level fusion on 4 feature maps obtained in a second stage, a third stage, a fourth stage and a fifth stage of the PycnvResNet-101 network to obtain a final feature map, wherein the number of channels of the final feature map is 1024, and randomly erasing data enhancement is used for expanding the training data, and the target central point, the target height and the central point offset are trained by using Focal Loss and Smooth L1;

step 2, obtaining the mostThe final feature map is sent to a subsequent prediction network, and the prediction network firstly adjusts the number of channels of the final feature map to 2 by using 3-by-3 convolutionⁿN is a positive integer, n is more than or equal to 3 and less than or equal to 9, then the target center point, the target height and the center point offset are respectively predicted by using two convolutions of 1 x 1 and one convolution of 2 x 2 to generate a target detection frame, and a non-maximum suppression algorithm is used for removing redundant detection frames to obtain a final detection result;

and 3, in the testing stage, adjusting the test image into a specific size and inputting the test image into the network, performing multi-stage fusion on the obtained characteristic graph and then sending the obtained characteristic graph into a prediction network, wherein the prediction network outputs the center point of the target, the height of the target and the offset of the center point of the target, and the target width is obtained by multiplying the target height by a coefficient.

Further, the process of step 1 is as follows: using the last profile p of stage two, stage three, stage four and stage five in a PycnvResNet-101 network₂,p₃,p₄And p₅Performing a multi-stage fusion wherein p₂,p₃,p₄And p₅Respectively obtaining the width and the height of an input image by respectively sampling 4 times, 8 times, 16 times and 32 times, and the multi-stage fusion comprises the following steps:

1.1) deconvolution with convolution kernel size 4 x 4, step size 2, margin 1 to p₅Up-sampling 2 times and p₄Splicing in the channel direction to obtain p_{4_l1}(ii) a Deconvolution of p with convolution kernel size 4 x 4, step size 2, margin 1 was used to convolve p₄Up-sampling 2 times and p₃Splicing in the channel direction to obtain p_{3_l1}(ii) a Deconvolution of p with convolution kernel size 4 x 4, step size 2, margin 1 was used to convolve p₃Up-sampling 2 times and p₂Splicing in the channel direction to obtain p_{2_l1}；

1.2) deconvolution of the convolution kernel size 4 x 4, step size 2, margin 1, was used to fit the feature p obtained in 1.1)_{4_l1}Up-sampling 2 times and obtaining the characteristic map p in 1.1)_{3_l1}Splicing in the channel direction to obtain p_{3_l2}(ii) a Deconvolution of p with convolution kernel size 4 x 4, step size 2, margin 1 was used to convolve p_{3_l1}Up-sampling 2 times ofGet a characteristic diagram p_{2_l1}Splicing in the channel direction to obtain p_{2_l2}；

1.3), deconvolution with convolution kernel size 4 x 4, step size 2, margin 1, and fitting the resulting feature map p in 1.2)_{3_l2}Up-sampling by 2 times and obtaining the characteristic map p in 1.2)_{2_l2}Splicing in the channel direction to obtain a final characteristic diagram p_outAnd sending the data into a subsequent prediction network.

Preferably, in the step 3, the coefficient is 0.41.

The invention has the beneficial effects that: the invention adopts CSP model architecture, adopts PycnvResNet-101 as a feature extraction network, performs multi-stage fusion on 4 feature graphs output by the feature extraction network, can fully fuse rich semantic information of a high-level feature graph and rich position information of a low-level feature graph, and can effectively reduce false detection and missed detection under the conditions of small targets and serious shielding.

Drawings

Fig. 1 is a flow chart of a multilevel feature fusion pedestrian detection method based on a CSP model according to the present invention.

FIG. 2 is a CSP model architecture diagram.

FIG. 3 is a schematic diagram of pyramid convolution.

Fig. 4 is a structure diagram of a multilevel feature fusion pedestrian detection method based on a CSP model.

Fig. 5 is a comparison graph of the effect of the CSP model-based multi-level feature fusion pedestrian detection method and other pedestrian detection technologies on the Caltech dataset, where (a) represents a Reasonable subset, (b) represents a Heavy subset, and (c) represents a Medium subset; (d) representing a Near subset; (e) representing the All subset.

Detailed Description

The invention is further illustrated by the following figures and examples.

Referring to fig. 1 to 5, a multilevel feature fusion pedestrian detection method based on a CSP model includes the following steps:

step 1, referring to fig. 2, a CSP framework is adopted, the pedestrian features are extracted by using CNN, and then the network is divided into 3 paths of predicting a target central point, a target height and a central point offset respectively. In the training stage, a training image is input into a network after being preprocessed, wherein the preprocessing comprises the steps of adjusting the size of the image to set pixels, randomly cutting the image and adjusting the brightness, using PycnvResNet-101 as a feature extraction network to extract pedestrian features, performing multi-level fusion on 4 feature maps obtained in a second stage, a third stage, a fourth stage and a fifth stage of the PycnvResNet-101 network to obtain a final feature map, the number of channels of the final feature map is 1024, using random erasure data enhancement to expand training data, and training a target center point, a target height and a center point offset by using Focal Loss and Smooth L1.

Each convolution kernel of PyconvResNet-101 comprises a multi-layered pyramid structure, each layer comprising a different type of convolution kernel, see fig. 3. Pyramid convolution can process an input image using convolution kernels of multiple scales without increasing the computational burden and model complexity. Convolution kernels at different levels of the pyramid have different sizes and channel numbers. The convolution kernel becomes progressively larger in size from the bottom layer to the top layer. At the same time, the number of channels of the convolution kernel gradually becomes smaller. And in order to adapt to the number of channels of convolution kernels of different layers of the pyramid, carrying out grouping convolution on the input feature map.

The specific steps of random erasing are as follows: for one image I in one batch, the probability of performing the random erasing process thereon is set to 0.5. For a picture with width W and height H, the picture area S is W × H. Random initialization erase region S_e. Is provided with

Setting the aspect ratio of the erase region to r_e∈[0.3,3.3]. The erase region has a height of

Width is

Then, randomly on the image IInitializing a point a ═ x_e,y_e) If x is_e+W_eW and y are not more than_e+H_eSetting the area I when the height is less than or equal to H_e＝(x_e,y_e,x_e+W_e,y_e+H_e) As a randomly erased area, otherwise repeating the above steps until a qualified I_eAnd occurs. I is_eEach pixel within a region is assigned a [0,255 ] value]Is calculated.

Referring to fig. 4, the last profile p of phase two, phase three, phase four and phase five in a pyconvResNet-101 network is used₂,p₃,p₄And p₅Performing a multi-stage fusion wherein p₂,p₃,p₄And p₅Respectively, the width and the height of the input image are respectively obtained by sampling 4 times, 8 times, 16 times and 32 times. The fusion mode specifically comprises the following steps:

1.1) deconvolution with convolution kernel size 4 x 4, step size 2, margin 1 to p₅Up-sampling 2 times and p₄Splicing in the channel direction to obtain p_{4_l1}(ii) a Deconvolution of p with convolution kernel size 4 x 4, step size 2, margin 1 was used to convolve p₄Up-sampling 2 times and p₃Splicing in the channel direction to obtain p_{3_l1}(ii) a Deconvolution of p with convolution kernel size 4 x 4, step size 2, margin 1 was used to convolve p₃Up-sampling 2 times and p₂Splicing in the channel direction to obtain p_{2_l1}。

1.2) deconvolution of the convolution kernel size 4 x 4, step size 2, margin 1, with the feature p obtained in step 1_{4_l1}After 2 times of upsampling, the feature map p obtained in step 1.1 is obtained_{3_l1}Splicing in the channel direction to obtain p_{3_l2}(ii) a Deconvolution of p with convolution kernel size 4 x 4, step size 2, margin 1 was used to convolve p_{3_l1}After 2 times of upsampling, the feature map p obtained in step 1.1 is obtained_{2_l1}Splicing in the channel direction to obtain p_{2_l2}。

1.3) deconvolution with convolution kernel size 4 x 4, step size 2, margin 1, the feature map p obtained in step 1.2_{3_l2}After 2 times of upsampling, the feature map p obtained in step 1.2_{2_l2}Splicing in the channel direction to obtain a final characteristic diagram p_outAnd sending the data into a subsequent prediction network.

Using Focal local and Smooth L1 as Loss functions: predicting the target center point is a binary problem, namely judging whether the target center point exists in each position of the feature map, if so, determining the target center point as a positive sample, and otherwise, determining the target center point as a negative sample. Because the negative samples around the positive sample are very close to the central point, the training is easy to be disturbed, and therefore, the two-dimensional Gaussian mask is added on the positive sample point during the training:

where K is the target number in the picture,

is the center point, width and height information of the kth target. Variance of Gaussian mask

Proportional to the height and width of the individual targets, respectively. If there is coincidence between the Gaussian masks of the two targets, the maximum of the two is selected.

For the predicted target center point, Focal local is adopted:

p_ij∈[0,1]indicating the possibility of the network determining the presence of an object center at location (i, j), y_ijE {0,1} represents a ground route tag, where y_ijWith 1 representing a positive sample, β and γ are hyper-parameters, set to 4 and 2, respectively.

Predicting the target height and center point offset is a regression problem, using Smooth L1:

wherein s is_kAnd t_kRespectively representing the predicted value and the true value of the network for each positive sample.

The total loss function is a weighted sum of three branch loss functions:

L＝λ_cL_center+λ_sL_scale+λ_oL_offset

wherein λ_c，λ_s，λ_oThe weight coefficients for the target center point classification loss, the scale regression loss, and the offset regression loss were set to 0.01, 1, and 0.1, respectively.

And 2, sending the obtained final feature map into a subsequent prediction network, wherein the prediction network firstly uses a convolution layer with a convolution kernel of 3 x 3, the step length of 1 and the margin of 1 to adjust the channel of the input feature map to 2ⁿN is a positive integer, n is not less than 3 and not more than 9, in this embodiment, n is 8, and then the convolution kernels of 1 × 1, 1 × 1 and 2 × 2 are respectively used to predict the target center point, the target height and the offset of the target center point.

And in the testing stage, the testing image is adjusted to a specific size and then is input into the network, the obtained characteristic graph is subjected to multi-stage fusion and then is sent into the prediction network, the prediction network outputs the center point of the target, the height of the target and the offset of the center point of the target, and the target width is obtained by multiplying the target height by a coefficient of 0.41. And analyzing to obtain a pedestrian prediction frame, and finally removing the redundant prediction frame by using a non-maximum suppression algorithm to obtain a final pedestrian detection frame.

The CSP model-based multi-level feature fusion pedestrian detection method is trained on a citreprersons training set and a Caltech training set respectively, tests are carried out on a citrerpersons verification set and a Caltech test set, and the evaluation index is the average logarithm omission ratio. As shown in table 1, table 2 and fig. 5, the method of the present invention is improved by 0.8%, 3.1%, 1.0%, 0.1%, 1.8% and 1.0% respectively in the citrerpersons validation set accessible subset, the Heavy subset, the Partial subset, the Bare subset and the Large subset, compared to the CSP algorithm. The improvement is respectively 0.4%, 10.5% and 4.8% in the Caltech test set Reasonable subset, Heavy subset and All subset. And simultaneously, the pedestrian detection device also shows a better effect in comparison with the existing pedestrian detection technology. Experimental results show that the detection performance of the CSP algorithm on small targets and seriously-shielded targets is effectively improved.

Table 1 shows the average log miss rate of each subset on the Citypersons validation set

TABLE 1

Table 2 shows the average log miss rate for each subset on the Caltech test set

Table 2.

Claims

1. A CSP model-based multi-level feature fusion pedestrian detection method is characterized by comprising the following steps:

step 2, sending the obtained final feature map into a subsequent prediction network, and firstly adjusting the number of channels of the final feature map to 2 by using 3-by-3 convolution in the prediction networkⁿN is a positive integer, n is more than or equal to 3 and less than or equal to 9, then the target center point, the target height and the center point offset are respectively predicted by using two convolutions of 1 x 1 and one convolution of 2 x 2 to generate a target detection frame, and a non-maximum suppression algorithm is used for removing redundant detection frames to obtain a final detection result;

2. The CSP model-based multi-level feature fusion pedestrian detection method according to claim 1, characterized in that: the process of the step 1 is as follows: using the last profile p of stage two, stage three, stage four and stage five in a PycnvResNet-101 network₂,p₃,p₄And p₅Performing a multi-stage fusion wherein p₂,p₃,p₄And p₅Respectively obtaining the width and the height of an input image by respectively sampling 4 times, 8 times, 16 times and 32 times, and the multi-stage fusion comprises the following steps:

1.1) deconvolution with convolution kernel size 4 x 4, step size 2, margin 1 to p₅Up-sampling 2 times and p₄In the direction of the passageRow stitching to p_{4_l1}(ii) a Deconvolution of p with convolution kernel size 4 x 4, step size 2, margin 1 was used to convolve p₄Up-sampling 2 times and p₃Splicing in the channel direction to obtain p_{3_l1}(ii) a Deconvolution of p with convolution kernel size 4 x 4, step size 2, margin 1 was used to convolve p₃Up-sampling 2 times and p₂Splicing in the channel direction to obtain p_{2_l1}；

1.2) deconvolution of the convolution kernel size 4 x 4, step size 2, margin 1, was used to fit the feature p obtained in 1.1)_{4_l1}Up-sampling 2 times and obtaining the characteristic map p in 1.1)_{3_l1}Splicing in the channel direction to obtain p_{3_l2}(ii) a Deconvolution of p with convolution kernel size 4 x 4, step size 2, margin 1 was used to convolve p_{3_l1}After 2 times of upsampling, the characteristic diagram p obtained in the step (1)_{2_l1}Splicing in the channel direction to obtain p_{2_l2}；

3. The CSP model-based multi-level feature fusion pedestrian detection method according to claim 1, characterized in that: in step 3, the coefficient is 0.41.