CN113269038B

CN113269038B - Multi-scale-based pedestrian detection method

Info

Publication number: CN113269038B
Application number: CN202110419108.4A
Authority: CN
Inventors: 任健; 邵文泽; 李海波
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2021-04-19
Filing date: 2021-04-19
Publication date: 2022-07-15
Anticipated expiration: 2041-04-19
Also published as: CN113269038A

Abstract

The invention discloses a multi-scale-based pedestrian detection method, which comprises the following steps of firstly, utilizing multi-scale feature fusion to learn more features of different scales, and achieving the purposes of distinguishing simple targets by utilizing shallow features and distinguishing complex targets by utilizing deep features; and secondly, in order to further improve the detection capability of the network on multi-scale, especially small targets, sliding windows with different sizes are used, so that the RPN outputs candidate regions generated by the sliding windows of different receptive fields. The pedestrian detection method has certain promotion in the aspect of pedestrian detection, has better robustness than the prior art, and can be used for detecting small-target pedestrians.

Description

Multi-scale-based pedestrian detection method

Technical Field

The invention relates to the field of computer vision image processing, in particular to a pedestrian detection method based on multiple scales.

Background

In recent years, computer vision technology has been rapidly developed with the support of deep learning, and has attracted a great deal of researchers. The ultimate goal is the same for thousands of researchers, although the focus of attention varies: to allow the technology to serve people, or to liberate productivity, or to improve quality of life. Since technology is serving people, research related to "people" is essential. Research related to "people" has played a leading role, both in academia and industry.

Pedestrian detection has received widespread attention in the past decade as the first and most basic step of many real-world tasks such as human behavior analysis, gait recognition, intelligent video surveillance and autopilot. However, although deep Convolutional Neural Networks (CNNs) have made great progress in detecting general targets, and have achieved good results, pedestrian detection, which is one of the important branches of general target detection, is a problem that has been difficult to solve for a long time. In the aspect of importance, pedestrian detection is a prerequisite for tasks such as pedestrian tracking, automatic driving and security monitoring. Although only "people" is a single category, we still face many challenges, such as detecting the diversity of scenes, the complexity of pedestrian gestures, and the possibility that the target to be detected is occluded.

Pedestrian detection plays an important role in the fields of intelligent monitoring and security protection, and monitoring equipment is equipped in most public places in order to prevent property safety, placement, deployment and the like. However, even when monitoring a large amount of pedestrian data in a device, only upon review by a specific person, problems arise in that, on the one hand, monitoring information for a long time, the person is certainly tired and information is wrong or missing as compared with a computer, and on the other hand, the limited ability to process information does not make full use of the monitored information. However, the defect of manual problem handling can be well made up by the relevant technology of pedestrian detection, so that manpower is saved, and early warning can be timely given in case of emergency.

The pedestrian detection technology is an important problem to be overcome and improved in the field of unmanned driving. Since the development of unmanned technology, pedestrian detection has been troubling many researchers as a problem to be solved and improved urgently. Although pedestrian detection has entered a rapid development stage since 2005, there still remain many problems to be solved, mainly two aspects, namely speed and accuracy, that are not yet a trade-off. In recent years, as the research and development of automatic driving technologies such as google are actively carried out, the emergence of an effective and rapid detection method for pedestrians is urgently needed so as to ensure that the safety of the pedestrians is not threatened during automatic driving. Thus, the solution of the pedestrian detection problem can fundamentally optimize the existing unmanned technology.

Pedestrian Detection (Pedestrian Detection) is the use of computer vision techniques to determine whether a Pedestrian is present in an image or video sequence and to provide accurate positioning. The technology can be combined with technologies such as pedestrian tracking, pedestrian re-identification and the like, and is applied to the fields of artificial intelligence systems, vehicle auxiliary driving systems, intelligent robots, intelligent video monitoring, human body behavior analysis, intelligent transportation and the like. Due to the characteristics of rigidity and flexibility of the pedestrian, the appearance is easily influenced by wearing, size, shielding, posture, visual angle and the like, so that the pedestrian detection becomes a hot topic with research value and great challenge in the field of computer vision.

Small objects are a very common problem in pedestrian detection, especially in autonomous driving or surveillance scenarios, which can be very challenging for existing algorithms when the pedestrian object is far from the camera. As a specific problem in general target detection, the existing CNN-based pedestrian detection method still comes from a general target detection method (e.g., fast R-CNN, SSD), and this method is completed by laying a target candidate box, which is called an anchor-based method. However, the anchor-based method suffers from three problems: firstly, a specific anchor point frame needs to be manually selected according to a specific data set to better match a pedestrian target, secondly, a threshold value needs to be manually set to define positive and negative samples, thirdly, deviation based on data set labeling exists in the training process, especially for a difficult sample and a small target, target information which can be learned by a network model in the pedestrian frame is very rare, and the deviation enables a detector to be more difficult to detect the difficult sample and the small target.

Disclosure of Invention

The invention aims to solve the technical problem of overcoming the defects of the prior art and provides a pedestrian detection method based on multiple scales, aiming at the detection of weak and small targets, the invention provides a two-stage pedestrian detection method which is based on multiple-scale angle and introduces multiple-scale feature fusion and multiple-scale receptive field RPN, and a universal multiple-scale model is constructed, so that the learning capability of a network on the features of small-target pedestrians can be improved, the detection precision of the small-target pedestrians is improved, and the detection omission is reduced.

The invention adopts the following technical scheme for solving the technical problems:

the invention provides a pedestrian detection method based on multiple scales, which comprises the following steps:

step 1, obtaining a pedestrian data set, wherein the pedestrian data set comprises a CityPersons pedestrian data set and a Caltech pedestrian data set;

step 2, building a pedestrian detection model, wherein the pedestrian detection model comprises a multi-scale feature fusion model and an RPN (resilient packet network), and the pedestrian detection model specifically comprises the following steps:

(1) constructing a multi-scale feature fusion model, wherein the construction process specifically comprises the following steps:

inputting the pedestrian data set into a first partial convolutional network, wherein the first partial convolutional network, a second partial convolutional network, a third partial convolutional network, a fourth partial convolutional network and a fifth partial convolutional network are sequentially connected in sequence,

the first partial convolution network is used for extracting a feature map fm1 of the pedestrian data set and outputting the feature map fm1 to the second partial convolution network;

the second partial convolution network is used for extracting the feature map fm2 of the pedestrian data set again and outputting the feature map fm2 to the third partial convolution network;

the third partial convolutional network is used for performing 3 convolutional layer operations of 3 × 3 and maximum pooling layer processing of 12 × 2 on the feature map fm2 to obtain a feature map fm3, and inputting the feature map fm3 into the fourth partial convolutional network;

the fourth part of the convolutional network is used for performing 3 convolutional layer operations of 3 multiplied by 3 and 1 maximum pooling layer processing of 2 multiplied by 2 on fm3 to obtain a feature map fm4, and inputting the feature map fm4 into the fourth part of the convolutional network;

a fifth partial convolutional network, configured to perform 3 convolutional layer operations of 3 × 3 and 1 maximum pooling layer processing of 2 × 2 on fm4, to obtain a feature map fm 5;

performing 1 × 1 convolution on the output fm5 obtained by the fifth part of convolution network, changing the number of channels of fm5, and recording the obtained output as M5;

performing 2 times of upsampling on the M5 to obtain upsampled M5; adding a feature map obtained by performing 1 × 1 convolution on fm4 and the up-sampled M5, and marking the obtained result as M4;

m4 is subjected to 2 times of upsampling to obtain upsampled M4; adding a feature map obtained by convolving fm3 by 1 × 1 to the up-sampled M4 to obtain a result which is recorded as M3;

recording the result obtained by performing 4-time down-sampling on the obtained M3 as M3, and recording the result obtained by performing 2-time down-sampling on the M4 as M4;

finally, the characteristic diagram of the addition output result of M3, M4 and M5 is sent to the RPN network;

(2) the RPN is used for generating a candidate region through a multiscale sensed wild sliding window; the method comprises the following specific steps:

the RPN is used for generating candidate regions by adopting 5 sliding windows with different sizes, the candidate regions are respectively realized by convolution of 1 × 1, 3 × 3, 5 × 5, 7 × 7 and 9 × 9, and finally, results formed by the obtained candidate regions with different receptive field sizes are merged;

step 3, inputting the pedestrian data set into the pedestrian detection model on the basis of the constructed pedestrian detection model, wherein the specific scheme is as follows: the pedestrian detection model training method based on the CityPersons comprises the steps that a built pedestrian detection model is pre-trained through a CityPersons pedestrian data set, the pedestrian detection model trained through the data set is obtained, on the basis of the model, the pedestrian detection model trained through a Caltech pedestrian data set is obtained through adjustment on the Caltech pedestrian data set, and pedestrian detection is achieved through the trained pedestrian detection model.

As a further optimization scheme of the multi-scale-based pedestrian detection method, in the training of the pedestrian detection model in the step 3, the pedestrian detection model is built by using a Pythrch deep learning frame, an optimization function is set as an Adam algorithm, a basic learning rate is set as 5e-3, e is a natural base number, the scale of the RPN network is quantized, and the aspect ratio of the RPN network is divided into [0.5, 0.65, 0.8, 0.95, 1.1, 1.25, 1.4, 1.55, 1.7, 1.85, 2], so that more anchor frames are generated; and correcting the parameters of the pedestrian detection model through error back propagation until the pedestrian detection model converges, and storing the parameters after the pedestrian detection model converges.

As a further optimization scheme of the pedestrian detection method based on the multi-scale, in step 1, a Citypersons pedestrian data set is a subset of a CitypScaps data set, the Citypersons pedestrian data set is labeled with content of a Human category, and the content of the Human category comprises pedestrians and riders.

As a further optimization scheme of the multi-scale-based pedestrian detection method, in step 1, a Caltech pedestrian data set is obtained by the following method: and (3) extracting the pedestrian video data set frame by frame and converting the format of the pedestrian video data set to obtain a single-frame image data set, wherein the single-frame image data set is a Caltech pedestrian data set.

In step 1, a CityPersons pedestrian data set and a Caltech pedestrian data set are stored in a VOC format.

As a further optimized solution of the multi-scale-based pedestrian detection method, the number d of channels of fm5 is 256.

As a further optimization scheme of the multi-scale-based pedestrian detection method, adding a feature map obtained by convolving fm4 by 1 × 1 to the up-sampled M5 means: the feature map obtained by 1 × 1 convolution of fm4 and the pixel value at the same position of M5 after upsampling are added, and the scale of the feature map obtained by 1 × 1 convolution of fm4 and the scale of M5 after upsampling are both 16 × 16.

As a further optimization scheme of the multi-scale-based pedestrian detection method, the addition of M3, M4 and M5 means that pixel values at the same positions of M3, M4 and M5 are added, and feature map sizes are all 16 × 16 in the same way.

Compared with the prior art, the technical scheme adopted by the invention has the following technical effects:

the method and the device improve the detection precision of the small target pedestrian, reduce missing detection, and improve the positioning precision and the robustness of the detection model.

Drawings

FIG. 1 is a schematic flow diagram of the framework of the present invention.

FIG. 2 is a structural diagram of a fast R-CNN target detection method.

FIG. 3 is a schematic diagram of a multi-scale feature fusion module.

Fig. 4 is a diagram of the multiscale receptor field RPN.

FIG. 5a is a graph showing the effect of the Adapt Faster R-CNN method.

FIG. 5b shows the effect of the method of the present invention.

FIG. 5c shows the effect of the Adapt Faster R-CNN method.

FIG. 5d shows the effect of the method of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.

(1) Obtaining a pedestrian data set

The pedestrian data set employs CityPersons pedestrian data set and Caltech pedestrian data set. The CityPersons dataset is a subset of the cityscaps dataset, and the labeling files of CityPersons label only the Human categories, i.e., person, rider, among them. The training set comprises 19654 pedestrians in 2975 images and 3938 pedestrians in 500 images in the verification set. The Caltech pedestrian data set is originally a video data set, a single-frame picture data set is obtained through frame-by-frame extraction and format conversion, and the training set comprises 122187 images. The two pedestrian data sets are stored in a VOC format, so that the pedestrian data set is convenient to train and use.

(2) Constructing a multi-scale feature fusion model

FIG. 2 shows a network structure of fast R-CNN, and FIG. 3 is a schematic diagram of a multi-scale feature fusion module.

The improvement is made on the basis.

The multi-scale feature fusion model comprises parts of a backbone network in the Faster R-CNN, which are respectively as follows: the first part of the convolutional network, the second part of the convolutional network, the third part of the convolutional network, the fourth part of the convolutional network and the fifth part of the convolutional network are sequentially connected; wherein, the first and the second end of the pipe are connected with each other,

as shown in fig. 1, the first partial convolution network is used for extracting a feature map fm1 of the pedestrian data set and outputting the feature map fm1 to the second partial convolution network;

the third partial convolutional network is used for performing 3 convolutional layer operations of 3 × 3 and 1 maximum pooling layer processing of 2 × 2 on the feature map fm2 to obtain a feature map fm3, and inputting the feature map fm3 to the fourth partial convolutional network;

performing 2 times of upsampling on the M4 to obtain upsampled M4; adding a feature map obtained by convolving fm3 by 1 × 1 to the up-sampled M4 to obtain a result which is recorded as M3;

finally, adding the M3, M4 and M5 to output a characteristic diagram of the result, and sending the characteristic diagram to the RPN network;

(3) multi-scale reception wild sliding window generation candidate region

In general, the original RPN network generation candidate region is a final-layer feature map generation candidate region output by a backbone network (VGG16, ResNet, Res2Net, or the like), and a sliding window of a fixed size of 3 × 3 is used. In consideration of the changeable conditions of the sizes of pedestrians caused by the distance from the camera, the occlusion and the like in a real scene, the candidate region cannot be obtained by using a sliding window with one size, and pedestrian targets with different sizes can be obtained by using the receptive field sliding windows with different sizes. As shown in fig. 4, compared with the original detection method, the present invention proposes to generate candidate regions by using 5 sliding windows with different sizes at the RPN stage, which are respectively implemented by 1 × 1, 3 × 3, 5 × 5, 7 × 7 and 9 × 9 convolutions, and the receptive fields of the pixel points on each feature image generated by the sliding windows with different sizes are as follows (taking the backbone network as VGG16 as an example):

[1] 1X 1: corresponding to, 196X 196

[2] 3X 3: corresponding to, 228X 228

[3] 5X 5: corresponding to, 260X 260

[4] 7X 7: corresponding to, 292X 292

[5] 9X 9: corresponding to, 324X 324

[1] - [5] the receptive field was calculated in a top-down manner: for the general convolution, we can draw a conclusion by representing the receptive field, and assuming that the initial size of the receptive field is 1, for each layer, the receptive field of the layer has a linear relation with the previous layer. The method is related to the step size and the convolution kernel size of each layer, has no relation with the padding value, and the receptive field only represents the mapping relation of the two layers and has no relation with the size of an original graph. The formula is as follows:

F(i,j-1)＝(F(i,j)-1)*stride+kernelsize

wherein F (i, j) represents the local receptive field of the ith layer to the jth layer, stride refers to the step size, and kernelsize refers to the size of the convolution kernel.

And finally, combining the results formed by the obtained candidate regions with different receptive field sizes, which is equivalent to expanding the number of the original candidate regions. The receptive field is the mapping relation between the pixel points of the convolutional layer output characteristic graph and the original image in the convolutional neural network, and according to the design, the accuracy of multi-scale pedestrian target detection is improved.

(4) Loss function

The loss functions involved in the invention are both classification loss and regression loss, and the formula is as follows:

the losses of the following two parts are introduced:

1) loss of classification

The anchor generated by the RPN network is only divided into a foreground and a background, wherein the label of the foreground is 1, and the label of the background is 0. During the training process, 256 anchors are generated, which correspond to N in the formula_cls；p_iPredicting the probability of being the target for the anchor;

is a GT tag.

The above is a log-loss of two classes (target and non-target), which is a classical cross-entropy loss.

In addition, Fast RCNN corresponds to cross entropy loss of multi-classification (when the number of training classes is greater than 2), but because the present invention is based on the object detection problem of pedestrian classes, the present invention uses a two-classification loss function.

2) Loss of return

Wherein t is_i＝{t_x,t_y,t_w,t_hIs a vector representing the generated anchor, i.e., RPN and predicted regression offsets corresponding to the Fast RCNN training phase; is that

Is and t_iThe same-dimension vectors represent the actual offsets from GT for the generated anchor, i.e., RPN and Fast RCNN training phases.

Is calculated for each anchor

After the part, multiply by p^*Setting p in the presence of an object (positive), as described above^*To 1, set p when no object (negative) is present^*A value of 0 means that only the foreground calculates the loss and the background does not.

(5) Evaluation index of results

For the pedestrian detection task, the evaluation index is MR-FPPI (false Positive image). Basic meaning of FPPI: given a certain number N of sample sets, N images are contained, with or without detection targets in each image. Miss Rate: the loss rate, which is the positive case in the dataset, is determined as the number of negative cases/the number of positive cases detected + the number of undetected positive cases (i.e., the number of all GTs), and is related to the Recall rate Recall: miss Rate 1-Recall.

(1) Preparation of data sets required for the experiments

Cityscaps have provided example level tagging. But these marks only provide pixels of the visible area. Training directly using these markers has the following problems: 1) the proportion of the candidate area is irregular, which can influence the normalization of the regression candidate frame; 2) not aligned with each pedestrian, possibly not the center of the pedestrian in the horizontal and vertical directions; 3) existing data sets mark entire pedestrians, not just the visible region. It is therefore necessary to re-label these pedestrians, i.e. the origin of the CityPersons pedestrian data set. The training set comprises 19654 pedestrians in 2975 images and 3938 pedestrians in 500 images in the verification set.

(2) Training model

The pedestrian detection model is built by utilizing a Pythroch deep learning framework, relevant parameters of model training are carried out through configuration files, an optimization function is set as an Adam algorithm, a basic learning rate is set as 5e-3, the scale of RPN is quantized, and a default height-width ratio [0.5, 1, 2] is divided into [0.5, 0.65, 0.8, 0.95, 1.1, 1.25, 1.4, 1.55, 1.7, 1.85, 2], so that more anchor frames are generated. And (3) sending the training set prepared in the step (1) into a built model, extracting shared features through a backbone network and a combined multi-feature fusion network, and then generating a candidate region through an RPN module. And finally, sending the data to a prediction module for prediction. The model parameters are corrected by error back propagation until the model converges. And saving the parameters after the model convergence.

(3) Results of the experiment

In order to verify the performance of the Pedestrian target Detection model, the invention is compared with an article "City questions: A diversity Dataset for Pedestrian Detection" published in 2017 IEEE by Shanshan Zhang, Rodrigo Benenson and Bernt Schiie et al, the model in which is abbreviated as Adapt Faster R-CNN. The difference between the invention and Adapt Faster R-CNN is that the invention introduces a multi-scale feature fusion and RPN module of multi-scale receptive field. The results of the specific experiments are shown in table 1.

TABLE 1

Note: MR^oRepresenting the error rate, the lower the error rate value, the better the result; Δ MR represents the change in error rate.

The results in Table 1 show that compared with the Adapt Faster R-CNN model, the performance index of the invention is slightly improved, which shows that the invention not only improves the detection effect of the model on small targets, but also reduces the missing rate of the model. However, according to the results of the visual detection boxes in fig. 5a, 5b, 5c and 5d, there are still some missed pedestrian targets, which indicates that the present invention still has a room for improvement in mining pedestrian detection of small targets.

The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention.

Claims

1. A pedestrian detection method based on multi-scale is characterized by comprising the following steps:

step 1, acquiring a pedestrian data set, wherein the pedestrian data set comprises a CityPersons pedestrian data set and a Caltech pedestrian data set;

inputting the pedestrian data set into a first part of convolutional network, wherein the first part of convolutional network, a second part of convolutional network, a third part of convolutional network, a fourth part of convolutional network and a fifth part of convolutional network are sequentially connected in sequence,

performing 2 times of upsampling on the M5 to obtain upsampled M5; adding a feature map obtained by convolving fm4 by 1 × 1 to the up-sampled M5 to obtain a result which is recorded as M4;

the RPN network is used for generating candidate regions by adopting 5 sliding windows with different sizes, and is respectively realized by convolution of 1 × 1, 3 × 3, 5 × 5, 7 × 7 and 9 × 9, and finally, the obtained results formed by the candidate regions with different receptive field sizes are merged;

2. The pedestrian detection method based on the multi-scale according to claim 1, wherein in the training of the pedestrian detection model in the step 3, the pedestrian detection model is built by using a Pythroch deep learning framework, an optimization function is set as an Adam algorithm, a basic learning rate is set to be 5e-3, e is a natural base number, the scale of the RPN network is quantized, and the aspect ratio of the RPN network is divided into [0.5, 0.65, 0.8, 0.95, 1.1, 1.25, 1.4, 1.55, 1.7, 1.85, 2], so that more anchor points are generated; and correcting the parameters of the pedestrian detection model through error back propagation until the pedestrian detection model is converged, and storing the parameters after the pedestrian detection model is converged.

3. The method as claimed in claim 1, wherein in step 1, the citipersonn data set is a subset of the citiscales data set, and the citipersonns data set is labeled with Human-like content, and the Human-like content includes pedestrians and riders.

4. The multi-scale-based pedestrian detection method according to claim 1, wherein in step 1, the Caltech pedestrian data set is obtained by: and (3) extracting the pedestrian video data set frame by frame and converting the format of the pedestrian video data set to obtain a single-frame image data set, wherein the single-frame image data set is a Caltech pedestrian data set.

5. The multi-scale-based pedestrian detection method according to claim 1, wherein in step 1, the CityPersons pedestrian data set and the Caltech pedestrian data set are stored in a VOC format.

6. The multi-scale-based pedestrian detection method of claim 1, wherein the number d of channels of fm5 is 256.

7. The method for detecting pedestrians according to claim 1, wherein the step of adding the feature map obtained by 1 x 1 convolution of fm4 and the up-sampled M5 is: the feature map obtained by convolution of fm4 by 1 × 1 and the pixel value at the same position of the M5 after upsampling are added, and the feature map obtained by convolution of fm4 by 1 × 1 and the M5 after upsampling have the same scale of 16 × 16.

8. The method for detecting pedestrians according to claim 1, wherein the addition of M3, M4 and M5 means that pixel values at the same positions of M3, M4 and M5 are added, and feature maps are all 16 x 16 in size.