CN111259736B

CN111259736B - Real-time pedestrian detection method based on deep learning in complex environment

Info

Publication number: CN111259736B
Application number: CN202010018507.5A
Authority: CN
Inventors: 孙丽华; 周薇娜
Original assignee: Shanghai Maritime University
Current assignee: Shanghai Maritime University
Priority date: 2020-01-08
Filing date: 2020-01-08
Publication date: 2023-04-07
Anticipated expiration: 2040-01-08
Also published as: CN111259736A

Abstract

The invention provides a pedestrian real-time detection method in a complex environment based on deep learning, which comprises the following steps: s1, establishing a detection model based on a YOLO algorithm; s2, selecting a plurality of pedestrian images in a complex environment in a color heat image library, establishing a training data set and a test data set, inputting the training data set into the detection model, and training the detection model; s3, inputting the test data set into the trained detection model, outputting detection results of pedestrian targets in the infrared thermal image and the RGB color image of the test data set, and screening the detection results by a non-maximum inhibition method; and S4, comparing the signal with a YOLOv3 and YOLO-tiny detection algorithm, and verifying the detection precision and the detection speed.

Description

Real-time pedestrian detection method based on deep learning in complex environment

Technical Field

The invention belongs to the field of target recognition, and particularly relates to a pedestrian real-time detection method in a complex environment based on deep learning.

Background

Pedestrian detection has received much attention from the computer vision community due to its wide application in the fields of driving assistance (autonomous driving of vehicles), robotics, human re-identification, video surveillance, pedestrian behavior analysis, and the like. At present, compared with the traditional method, the deep learning technology has a good effect in the pedestrian detection field. However, in some complex natural environments, the results of detection tasks relying only on visible or infrared spectral images are not accurate enough. Pedestrian targets encounter many challenges in complex environments, such as smoke, rain, dust, dimness of light, and the like. The RGB color image has rich spectral information, can reflect the details of a scene under certain illumination, but is difficult to detect a target when the visibility is poor; an infrared thermal image is a kind of thermal radiation image, the gray level is determined by the temperature difference between the observation target and the background, and structural information is generally lacking, and therefore, the target is easily mixed with the background, resulting in false detection and missing detection. In this case, the reliability and the practicability of the pedestrian target detection system are seriously affected, so that the research subject of real-time pedestrian detection in a complex environment has great practical significance.

Disclosure of Invention

The invention aims to provide a real-time pedestrian detection method based on deep learning in a complex environment, which can quickly and accurately detect pedestrian targets in infrared light and RGB (red, green and blue) color images in the complex environment, improve the recognition capability of small targets with smaller pixels and ensure the detection effect and the detection speed.

In order to achieve the above object, the present invention provides a method for detecting pedestrians in real time under a complex environment based on deep learning, comprising the steps of:

s1, establishing a detection model based on a YOLO algorithm; the detection model specifically comprises, connected in sequence: a five-layer convolution network layer based on ResNet, a three-layer maximum pooling layer based on SPP, and a three-layer target detection layer;

s2, selecting a plurality of pedestrian images in a complex environment in a color thermal image library, and setting the pedestrian images to be in a preset size; selecting a part of pedestrian images from the plurality of pedestrian images as a training data set, and using the rest of the pedestrian images as a test data set; inputting the training data set into the detection model, and training the detection model;

s3, inputting the test data set into the trained detection model, outputting a detection result of the pedestrian target in the test data set, and screening the detection result;

and S4, comparing the signal with a YOLOv3 and YOLO-tiny detection algorithm, and verifying the detection precision and the detection speed.

In the detection model in step S1, each maximum pooling layer includes a filter, and the filter sizes of the three maximum pooling layers are 5 × 5, 9 × 9, and 13 × 13 pixels, respectively.

In the detection model in the step S1, the detection scales of the three target detection layers are respectively 13 × 13, 26 × 26, and 104 × 104 pixels; and respectively generating three corresponding anchor boxes with different sizes for each layer of target detection layer according to the detection scale of each layer of target detection layer by using a K-means clustering algorithm.

The predetermined size in step S2 is 416 × 416 pixels.

Step S3 specifically includes:

s31, extracting pedestrian target characteristics in the test data set through a five-layer convolution network layer based on ResNet;

s32, further extracting pedestrian target features in the test data set through three maximum pooling layers based on the SPP;

s33, predicting a boundary frame coordinate value, a target confidence score and a pedestrian target category probability of a pedestrian target in the test data set based on the extracted pedestrian target feature through three target detection layers and based on a multi-scale prediction strategy;

and S34, screening to obtain a detection result of the pedestrian target according to the coordinate value of the boundary frame, the target confidence score and the pedestrian target category probability by a non-maximum value inhibition method.

Compared with the prior art, the invention has the advantages that:

1) The pedestrian detection method can detect the pedestrian target from the RGB color image and the infrared thermal image collected under the complex environments such as smoke, rainwater, dust, dim light and the like, and has good robustness for the interference of the complex environment to the image;

2) The detection model used by the invention reduces the number of network layers for extracting the target features, and the pedestrian features in the image are extracted through the five-layer convolution network layer based on ResNet and the three-layer maximum pooling layer based on SPP, so that the target features in a complex environment can be effectively extracted, the detection precision is ensured, and the rate of extracting the target features is obviously improved;

3) The five-layer convolution network layer based on ResNet can effectively control the propagation of the gradient, and avoids the situation that the detection model is not beneficial to training due to the disappearance or explosion of the gradient;

4) The invention carries out multi-scale detection by arranging anchor boxes with different detection scales and different sizes on three target detection layers, thereby greatly improving the detection capability of small targets with pixels smaller than 80 multiplied by 40.

Drawings

In order to more clearly illustrate the technical solution of the present invention, the drawings used in the description will be briefly introduced, and it is obvious that the drawings in the following description are an embodiment of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts according to the drawings:

FIG. 1 is a schematic view of a detection model according to the present invention;

FIG. 2 is a schematic diagram of multi-scale prediction performed by three target detection layers according to the present invention;

FIG. 3 is a schematic diagram of a five-layer convolutional network layer structure based on ResNet according to the present invention;

FIG. 4 is a flowchart of a pedestrian real-time detection method based on deep learning in a complex environment.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention provides a real-time pedestrian detection method based on deep learning in a complex environment, which is used for detecting a pedestrian target in a pedestrian image in the complex environment. In the embodiment of the invention, the adopted hardware configuration is servers of an Intel i7 8700k processor, a NVIDIA TITAN XP graphic card and a 64GB RAM, and the software environment is an Ubuntu16.04 system and a Darknet framework.

As shown in fig. 4, the method for detecting pedestrians in real time in complex environment based on deep learning of the present invention includes the steps of:

s1, establishing a detection model based on a YOLO (You can Only live Once) algorithm; as shown in fig. 1, the detection model specifically includes the following components connected in sequence: a five-layer convolutional network layer based on ResNet (deep residual error network), a three-layer maximum Pooling layer based on SPP (Spatial Pyramid Pooling), and three-layer target detection layer;

the five-layer convolution network layer based on the ResNet and the three maximum pooling layers jointly form a feature extraction network of the detection model, and the feature extraction network is used for extracting pedestrian target features of RGB color images and infrared thermal images.

Fig. 3 is a schematic diagram of a five-layer convolutional network layer structure based on ResNet of the present invention, in which Conv (volumetric layer) is a convolutional layer, max (Max pool layer) is a maximum pooling layer, res (residual layer) is a residual layer, filter is a Filter, size represents the Filter Size, and Output represents the Output eigen-map pixel Size. As shown in fig. 1, the characteristic image pixel sizes of the five-layer convolutional network layer output are 208 × 208, 104 × 104, 52 × 52, 26 × 26 and 13 × 13 in sequence.

For the traditional deep learning-based network model, the deeper the network, the more things can be learned. Of course, the slower the convergence speed, the longer the training time. However, after the network depth reaches a certain degree, the situation that the learning rate is reduced can be found, even in some scenes, the classification accuracy is reduced as the learning network layer number is deeper, and the disappearance of the gradient and the explosion of the gradient can easily occur. That is, if the learning network is too deep, the detection model becomes insensitive, and the final classification effect of the detection model is not good. ResNet introduces the design of a residual block, and solves the problems that the learning rate is low and the accuracy rate cannot be effectively improved due to deepening of the network depth. Therefore, good performance can be ensured while the network model is trained.

As shown in fig. 1, in the detection model of the present invention, in the three largest pooling layers based on SPP, each largest pooling layer includes a filter, and the filter sizes of the three largest pooling layers are 5 × 5, 9 × 9, and 13 × 13 pixels, respectively. The three maximum pooling layers based on the SPP can divide RGB color images and infrared thermal images into thicker layers from thinner layers, and aggregate local features on all the layers, so that pedestrian target features in the RGB color images and the infrared thermal images are further acquired, and the detection accuracy of pedestrian targets is improved.

The three target detection layers are used for detecting the pedestrian target from the RGB color image and the infrared thermal image according to the pedestrian target characteristics.

As shown in fig. 1, the detection scales of the three target detection layers are respectively 13 × 13, 26 × 26, and 104 × 104; and generating three corresponding anchor boxes with different sizes for each layer of target detection layer respectively according to the detection scale of each layer of target detection layer by a K-means (K-means clustering) algorithm. The dimensions of the anchor boxes of each target detection layer are shown in table 1:

TABLE 1 width and height gauge of anchor box

S2, selecting a plurality of pedestrian images in a complex environment in a color heat image library, and setting the pedestrian images to be in a preset size; selecting a part of pedestrian images from the plurality of pedestrian images as a training data set, and using the rest of the pedestrian images as a test data set; inputting the training data set into the detection model, and training the detection model;

in an embodiment of the present invention, 1000 images of pedestrians in a complex environment are selected from the OSU color thermal database of OTCBVS. These pedestrian images are fused with an infrared thermal image and an RGB color image. 700 of 1000 pedestrian images are selected as a training data set, the other 300 pedestrian images are selected as a testing data set, and pedestrian labels in all the pedestrian images are manually marked by using a labelImg (image marking tool).

In the embodiment of the present invention, the pedestrian image is first set to 416 × 416 pixels. The detection model was trained by the stochastic gradient descent method. In the process of training the detection model, an initial learning rate is set to be 0.001, momentum (momentum) is set to be 0.9, weight attenuation (decade) is set to be 0.005, and batch size (batch size) is set to be 16, the training data is iterated 12000 times by using the learning rate of 0.001, then the iteration is continued 6000 times by using the learning rate of 0.0001, and finally the iteration is continued 3000 times by using the learning rate of 0.00001. The IOU (interaction-Over-Union ratio) for positive and negative samples is set to 0.5.

S3, inputting the test data set into the trained detection model, outputting a detection result of the pedestrian target in the test data set, and screening the detection result by a non-maximum inhibition method;

step S3 specifically includes:

and S33, predicting the coordinate value of the boundary frame of the target in the test data set, the target confidence score and the pedestrian target class probability based on the extracted pedestrian target characteristics through three target detection layers and based on a multi-scale prediction strategy.

As shown in fig. 2, step S33 specifically includes: taking the output of the SPP-based three-layer maximum pooling layer as the input of the three-layer target detection layer. The three target detection layers sequentially comprise a first target detection layer, a second target detection layer and a third target detection layer, detection is carried out on three scales of 13 x 13 pixels, 26 x 26 pixels and 104 x 104 pixels respectively, 2 times of upsampling is used for carrying out characteristic map scale transfer between the first target detection layer and the second target detection layer, and 4 times of upsampling is used for carrying out characteristic map scale transfer between the second target detection layer and the third target detection layer. And through two times of upsampling, combining the extracted characteristics of different target detection layers. The detection scale of the first layer object detection layer is 13 × 13 pixels, that is, the unit cell pixel size generated by the first layer object detection layer is 13 × 13. In each target detection layer, each cell is predicted by 3 corresponding anchor boxes to obtain 3 bounding boxes. In the embodiment of the present invention, as shown in table 1, the cells of the first layer target detection layer are predicted to obtain 3 bounding boxes by three anchor boxes having widths (pixels) and heights (pixels) of (18.9479,11.9025), (44.8887,22.4906), and (88.8151,27.1594), respectively. Each bounding box is used to predict four bounding box coordinate values. Outputting (1 +4+ C) values (4 frame coordinate values, 1 target confidence score, C is the predicted target category number) through each bounding box, wherein the pedestrian target category probability in the invention is one of C target categories;

and S34, screening to obtain a detection result of the pedestrian target according to the coordinate value of the boundary frame, the target confidence score and the pedestrian target category probability by a non-maximum value inhibition method. Step S34 specifically includes:

s341, identifying the pedestrian target according to the pedestrian target category probability;

s342, screening a detection result by a non-maximum inhibition method;

in step S33, when detecting a pedestrian target, a plurality of detection result candidate frames are generated on an image for one pedestrian target, where one candidate frame is the predicted boundary frame, and each boundary frame corresponds to one target confidence score; and sorting the target confidence degree scores corresponding to the same pedestrian target, selecting the candidate frame with the highest target confidence degree score as a standard frame, and using other candidate frames as reference frames. Then, the overlapping degree (iou) of the reference frame and the standard frame is calculated, if the overlapping degree is larger than a set threshold value, the reference frame is deleted, and the standard frame is used as a detection result of the pedestrian target. Since there may be several high scoring boxes for the same detected pedestrian target, we need only the one with the best results.

And S4, comparing the signal with a YOLOv3 and a YOLO-tiny detection algorithm, and verifying the detection precision and the detection speed.

Table 2 shows the detection accuracy and detection speed ratio table of YOLOv3, YOLO-tiny and Ours (detection method of the present invention). The target detection Precision is reflected by the mAP (Mean Average Precision Mean) in Table 2, where FPS (Frame per second) is the number of detection frames per second and represents the target detection speed.

TABLE 2 comparison of detection algorithms

The mAP calculation method is shown as a formula (1) and a formula (2),

wherein, the AP is Average Precision, which represents the detection Precision of the pedestrian category; r represents the number of all objects in the pedestrian category in the test data set; n represents the number of all category targets in the test dataset; j denotes the serial number of the target, if associated with the true value, I _j Is 1, otherwise I _j Is 0; and R is _j Is the number of related objects in the first j objects. q represents a certain category; q _R Indicating the total number of categories. The mAP (average AP value) is the average of a plurality of classes of AP values, and the value is between 0 and 1, because the invention only has one pedestrian target class, the mAP value is equal to the AP value, and the larger the value is, the higher the detection precision of the algorithm is.

As can be seen from Table 2, the detection rate of YOLO-tiny is fast, but the detection precision is low, and the practicability is poor. The detection accuracy of the detection algorithm is far higher than that of YOLO-tiny, although the detection accuracy is slightly lower than that of YOLOv3, the detection speed is obviously improved compared with that of YOLOv 3. Therefore, the comprehensive performance of the invention is optimal.

The pedestrian real-time detection method based on the deep learning complex environment can be used for real-time monitoring scenes based on RGB color images or infrared thermal images. The invention adds three maximum pooling layers based on SPP behind five convolutional network layers, further extracts the pedestrian target characteristics in the image; and the detection capability of small targets with pixels smaller than 80 multiplied by 40 is greatly improved through a multi-scale detection strategy of three target detection layers.

In the prior art, a Darknet53 network used by a YOLOv3 detection algorithm is too complex and redundant in detection of a pedestrian target, and the detection speed is reduced due to the fact that the training is complex and a large training data set is needed due to the fact that the parameters are too many; the detection speed of YOLOv3-tiny in the prior art is at the cost of reducing the detection precision. Based on experimental results, the real-time pedestrian detection method based on the deep learning complex environment is more excellent in comprehensive detection performance, more accurate in detection result and higher in real-time performance, and solves the problems that the YOLOv3 algorithm parameter quantity is too large and the YOLO-tiny recall rate is low.

While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the technical scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A pedestrian real-time detection method in a complex environment based on deep learning is characterized by comprising the following steps:

s1, establishing a detection model based on a YOLO algorithm; the detection model specifically comprises the following components connected in sequence: a five-layer convolution network layer based on ResNet, a three-layer maximum pooling layer based on SPP, and a three-layer target detection layer;

in the detection model in the step S1, the detection scales of the three target detection layers are respectively 13 × 13, 26 × 26, and 104 × 104 pixels; generating three corresponding anchor boxes with different sizes for each layer of target detection layer respectively according to the detection scale of each layer of target detection layer by a K-means clustering algorithm;

s3, inputting the test data set into the trained detection model, outputting a detection result of the pedestrian target in the test data set, and screening the detection result; step S3 includes:

s34, screening to obtain a detection result of the pedestrian target according to the coordinate value of the boundary frame, the target confidence score and the pedestrian target category probability through a non-maximum value inhibition method;

2. The method for detecting pedestrians in a complex environment based on deep learning of claim 1, wherein in the detection model of step S1, each maximum pooling layer includes a filter, and the filter sizes of the three maximum pooling layers are 5 × 5, 9 × 9, and 13 × 13 pixels, respectively.

3. The method for detecting the pedestrian in the complex environment based on the deep learning as claimed in claim 1, wherein the predetermined size in step S2 is 416 × 416 pixels.

4. The method for detecting the pedestrian in the complex environment based on the deep learning as claimed in claim 1, wherein the step S2 is to train the detection model by a stochastic gradient descent method.

5. The method for detecting the pedestrians in the complex environment based on the deep learning of claim 1, wherein the pedestrian image in the step S2 is fused with an infrared thermal image and an RGB color image.