CN113033478A

CN113033478A - Pedestrian detection method based on deep learning

Info

Publication number: CN113033478A
Application number: CN202110420061.3A
Authority: CN
Inventors: 卢立晖; 索婕; 王化建; 张立华; 司鹏程; 丁明亮; 李磊; 张正强
Original assignee: Rizhao Huilian Zhongchuang Intelligent Technology Research Institute; Qufu Normal University
Current assignee: Rizhao Huilian Zhongchuang Intelligent Technology Research Institute; Qufu Normal University
Priority date: 2021-04-19
Filing date: 2021-04-19
Publication date: 2021-06-25

Abstract

The invention discloses a pedestrian detection method based on deep learning, belonging to the technical field of deep learning and pedestrian detection and comprising the following steps of: based on the traditional SSD pedestrian detection model, ResNet, VoVNet and K-means clustering is adopted for optimization, the problems of missing detection and false detection caused by dense or shielded pedestrians and undersized pedestrians in the SSD algorithm are solved, and the accuracy and the real-time performance of pedestrian detection and the small-target pedestrian detection performance are improved.

Description

Pedestrian detection method based on deep learning

Technical Field

The invention relates to the technical field of deep learning and pedestrian detection, in particular to a pedestrian detection method based on deep learning.

Background

Pedestrian detection is an important research branch in the field of computer vision, and the main task is to judge whether a pedestrian appears in an input image or video sequence and determine the position of the pedestrian. The pedestrian detection technology is widely applied to a plurality of fields such as video monitoring, vehicle auxiliary driving, intelligent robots and the like.

At present, the computer vision technology is rapidly developed, and the pedestrian detection is also greatly improved as an important research field, and gradually tends to practical application. With the research and application of the deep learning algorithm in pedestrian detection, a series of deep learning pedestrian detection algorithms are derived on the basis of the convolutional neural network. Compared with the traditional detection algorithm, the deep learning algorithm has stronger robustness and generalization capability, and can detect the pedestrian target more quickly and accurately. The pedestrian detection method has the advantages that continuous innovation and optimization of a pedestrian detection theory are benefited, the pedestrian detection provides technical support for the aspects of intelligent monitoring, unmanned driving and the like, and the pedestrian detection method has great application value.

However, in an actual monitoring scene, the current pedestrian detection and calculation method still has the problems of false detection and missing detection of pedestrians, and is easily influenced by factors such as shielding, pedestrian postures and scale changes, and the detection performance needs to be further enhanced.

Therefore, how to implement a pedestrian detection method based on deep learning is a problem that needs to be solved urgently by those skilled in the art.

Disclosure of Invention

In view of the above, the present invention provides a pedestrian detection method based on deep learning, and aims to optimize an SSD algorithm for the problems of missing detection, false detection and long time consumption caused by dense or blocked pedestrians and too small pedestrian postures in the SSD algorithm, so as to improve the accuracy and speed of pedestrian detection and the small target pedestrian detection performance.

In order to achieve the purpose, the invention adopts the following technical scheme:

a pedestrian detection method based on deep learning comprises the following steps:

s100: acquiring a sample data set with a pedestrian target, and preprocessing the sample data set;

s200: building an SSD pedestrian detection model, and optimizing the SSD pedestrian detection model to obtain an optimized SSD pedestrian detection model;

s300: sending the sample data set obtained through the preprocessing in the step S100 into an optimized SSD pedestrian detection model for training to generate a preselection frame, and processing to obtain a detection frame;

s400: and detecting the pedestrian target in the sample data set by using the detection frame, and outputting and displaying the detection result.

Preferably, when step S300 is performed, K-means clustering is performed on the sample data set to obtain the optimal aspect ratio of the preselected frame, including:

s10: setting k clustering centers, and setting the coordinates of the clustering centers as (W)_i,H_i) Calculating the distance between each preselection frame and each clustering center, and distributing the preselection frame to the nearest clustering center, wherein the specific expression is as follows:

d＝1-IOU[(x_j,y_j,w_j,h_j),(x_j,y_j,W_j,H_j)]

j∈{1,2,…,N}，i∈{1,2,…,k}

wherein d is the cluster center distance, (x)_j,y_j,w_j,h_j) For the corresponding coordinates of the real frames, IOU is the intersection ratio between two frames, N is the number of the preselected framesQuantity, k is the number of clustering center points;

s20: after the pre-selection frame is distributed, re-calculating the cluster center point of each cluster, namely calculating the average value of the width and the height of all the pre-selection frames, wherein the specific expression is as follows:

s30: repeating the step S10 and the step S20, and when the change of the clustering center is not obvious, obtaining the average value of the width and the height of the preselection frame at the moment to obtain the corresponding preselection frame;

s40: clustering the sample data set by using the preselection frame, and re-determining the width and the height of the preselection frame, wherein the specific expression is as follows:

wherein m is_dRepresenting down-sampling magnification, w_rWidth, w, of the pre-selection box_kRepresenting the width, h, of the input image_rHeight of the pre-selection box, h_kRepresenting the height of the input image;

s50: and obtaining the optimal aspect ratio of the preselected frame according to the width and height values of the preselected frame.

Preferably, step S300 specifically includes:

s310: constructing an SSD network framework based on a ResNet residual error network structure, and constructing an SSD pedestrian detection model according to the SSD network framework to form an SSD pedestrian detection model;

s320: adding a VoVNet network into the SSD pedestrian detection model to obtain an optimized SSD pedestrian detection model;

s330: setting corresponding training parameters, creating a training set according to the training parameters to train the optimized SSD pedestrian detection model, and stopping training when the optimized SSD pedestrian detection model reaches the maximum iteration times to obtain a trained optimized SSD pedestrian detection model;

s340: and (5) sending the sample data set obtained in the step (S100) into a trained optimized SSD pedestrian detection model, generating a preselection frame, and processing to obtain a detection frame.

Preferably, the step S310 specifically includes:

s311: the SSD pedestrian detection model is composed of a plurality of residual block groups, each residual block group comprises a plurality of residual blocks, the output of the previous residual block is subjected to 1 x 1 convolution and converted into the same dimension, and the output of the previous residual block is used as the input of the whole residual structure and is input into the first convolution layer;

s312: the first convolution layer is connected with the SSD pedestrian detection model, and the output of the first convolution layer is used as the input of the next convolution layer;

s313: and combining the output of the next convolution layer after normalization and nonlinear function operation with the output of the previous residual error structure to form an SSD pedestrian detection model.

Preferably, the step S320 of joining the VoVNet network structure includes:

and sequentially connecting the first convolution layer and the residual block group in series according to the VoVNet network structure and finally performing one-time aggregation to obtain the optimized SSD pedestrian detection model.

Preferably, the step S100 further includes acquiring a test data set while acquiring the sample data set with the pedestrian target, testing the trained optimized SSD pedestrian detection model in the test data set, and outputting the tested trained optimized SSD pedestrian detection model.

Preferably, the method further includes the step S500:

s510: judging whether all the pre-selection frames are trained; if yes, go to step S520;

s520: carrying out non-maximum suppression processing on the detection frames, removing redundant detection frames and determining a unique detection frame;

s530: detecting the pedestrian target in the sample data set according to the unique detection frame;

s540: and outputting and displaying the detection result.

Compared with the prior art, the pedestrian detection method based on deep learning has the following beneficial effects that:

(1) the network structure fusing the ResNet and the VoVNet network models can effectively fuse multi-layer feature information and map the multi-layer feature information to deeper complex feature representations, so that the performance of target detection is improved, and a better detection effect is achieved on small target detection;

(2) the VoVNet network model aggregates the intermediate features of the last layer of each residual block at one time to form final feature mapping, more shallow features are aggregated on a transition layer, the number of network structure layers is reduced, the intermediate layers of the VoVNet model have the same input and output sizes, the network operation speed is higher, energy is saved, and the number of network layers is less;

(3) the optimal aspect ratio of the preselection frame is automatically obtained by using a K-means clustering experiment, the problem that the SSD algorithm depends on manual setting and experience is solved, the small target detection effect is enhanced, and the condition of missing detection is reduced.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a flow chart of a method provided by the present invention;

fig. 2 is a schematic diagram of an optimized SSD network structure provided in this embodiment.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention discloses a pedestrian detection method based on deep learning, which comprises the following steps:

In a specific embodiment, the sample data set with the pedestrian target obtained in step S100 may be local upload data or a public sample data set with a target pedestrian label;

specifically, the public sample data set may be a COCO data set, where COCO is a data set for detecting multiple types of targets, and includes 80 types, not only pedestrian images. The data set comprises about 33 thousands of pictures and more than 200 thousands of label information, and not only can target detection and positioning be carried out, but also target key point analysis and semantic understanding can be carried out. The open source of the COCO data set enables the image segmentation semantic understanding to make great progress, and the COCO data set also almost becomes a 'standard' data set for evaluating the performance of the image semantic understanding algorithm.

Specifically, if the acquired sample data set is local upload data, pedestrian data in the local sample data set also needs to be labeled, and a pedestrian data labeling file is generated;

more specifically, the process of labeling the local sample data set is as follows:

firstly, extracting pictures from a locally acquired video, naming the pictures according to a jpg format and storing the named pictures into a corresponding folder; marking the uniformly named pictures by using a marking tool, establishing a marking frame for people in the pictures, marking all pedestrians in the frame with labels, storing and generating corresponding xml files; and then, carrying out the next picture, repeating the step of marking until the pedestrians in all the pictures are labeled, directly numbering different pedestrians when the pedestrians are labeled in the process, and verifying whether the pedestrians can be identified through experiments.

In a specific embodiment, the S100 preprocessing the sample data set includes: and carrying out gray level processing, filtering processing and threshold segmentation processing on the images in the sample data set.

In a specific embodiment, the K-means clustering is performed on the sample data set to obtain the optimal aspect ratio of the preselected frame when performing step S300, and the method includes:

d＝1-IOU[(x_j,y_j,w_j,h_j),(x_j,y_j,W_j,H_j)]

j∈{1,2,…,N}，i∈{1,2,…,k}

wherein d is the cluster center distance, (x)_j,y_j,w_j,h_j) The corresponding coordinates of the real frames are represented by IOU, the intersection and parallel ratio between the two frames is represented by N, the number of the preselected frames is represented by k, and the number of the clustering center points is represented by k;

Through the steps, the problem that the proportion of the preselected frame in the SSD algorithm mostly needs to be manually set and depends too much on manual experience is solved. And the optimized SSD network model automatically acquires the optimal clustering number K value and the corresponding proportional value by adopting a K-means clustering method, the proportional value can be set to be 0.764 according to the clustering, and the aspect ratio of the preselected frame is modified. The size selected through the clustering experiment is closer to the real frame size in the pedestrian detection process, and the pedestrian target can be quickly and accurately detected.

In a specific embodiment, step S300 further includes:

In a specific embodiment, step S310 is specifically as follows:

In an embodiment, the step S320 of joining the VoVNet network specifically includes:

Specifically, the optimized SSD network structure is shown in fig. 2, and includes: the ResNet structure comprises three groups of residual block groups, namely a residual block group 1, a residual block group 2 and a residual block group 3; after the VoVNet is added, the VoVNet includes a convolution layer, a residual block group and an aggregation module,

more specifically, the optimized SSD pedestrian detection model has a specific structure including: the method comprises the steps of firstly winding a layer, a residual block group 0_ residual block 0 to a residual block group 0_ residual block 6, a first aggregation module, a residual block group 1_ residual block 0 to a residual block group 1_ residual block 6, a second aggregation module, a residual block group 2_ residual block 0 to a residual block group 2_ residual block 3, a residual block group 2_ residual block 4 to a residual block group 2_ residual block 6, a winding layer 3_2, a winding layer 4_2, a winding layer 5_2 and a winding layer 6_2, then sequentially connecting in series, and finally conducting one-time aggregation to obtain an optimized SSD pedestrian detection model.

More specifically, the working principle of the optimized SSD pedestrian detection model is as follows: a polymerization module is arranged behind each group of residual block groups, ResNet is added into the polymerization module behind each residual block group, characteristic fusion is carried out between each two residual blocks in the residual block groups through nonlinear transformation Conv + BN + ReLU combination, the selected residual block group 2_ residual block 3, the selected residual block group 2_ residual block 6 and 4 convolution layers containing 1 x 1 and 3 x 3 are subjected to characteristic extraction, namely, the optimized SSD pedestrian detection model extracts characteristic information by using six characteristic diagrams of the residual block group 2_ residual block 0 to the residual block group 2_ residual block 3, the residual block group 2_ residual block 4 to the residual block group 2_ residual block 6, the convolution layer 3_2, the convolution layer 4_2, the convolution layer 5_2 and the convolution layer 6_2, a backbone network uses a Net structure, the obtained characteristic diagram refers to the fact that a VoVNet network model is connected to the next layer and carries out one-time polymerization in the final characteristic diagram, and forming final characteristic output to obtain the optimized SSD pedestrian detection model.

Through the steps, the output of each layer of the optimized SSD pedestrian detection network structure is not directly connected to all subsequent intermediate layers, so that the input size of the intermediate layers is kept unchanged; the shallow features are more gathered on the transition layer, and the deep features have little influence on the transition layer, so that the network parameters and the number of intermediate structural layers are reduced on the premise of not influencing feature transmission. The optimized SSD pedestrian detection structure integrates the advantages of ResNet and VoVNet, combines features of different layers together to describe a target together while learning residual errors, and continuously integrates shallow layer feature information into a deep layer network structure, so that final feature output fully combines the shallow layer feature information and the deep layer network feature information, and features can be better learned. The model further effectively relieves the problems of gradient loss and insufficient precision of small target detection on the basis of ResNet-SSD, so that the network is easy to train; the parameter quantity of the network is greatly reduced, the resource waste is reduced, and the target detection performance is improved.

In a specific embodiment, the step S100 of obtaining the sample data set with the pedestrian target further includes obtaining a test data set, testing the trained optimized SSD pedestrian detection model in the test data set, and outputting the tested trained optimized SSD pedestrian detection model.

Specifically, an SSD detection model is built based on a PyTorch deep learning framework, the detection model is modified into a two-classification model suitable for pedestrian detection, an SSD pedestrian detection model is built according to the PyTorch framework and an SSD algorithm framework, and the trained optimized SSD pedestrian detection model is tested according to a test data set.

In a specific embodiment, the step S340 of obtaining the detection frame through processing is to obtain a corresponding detection frame according to the preselected frame matching, specifically:

s341: searching a corresponding detection frame with the maximum cross-over ratio according to the preselection frames for matching, and ensuring that each preselection frame has one detection frame corresponding to the preselection frame;

s342: and for the remaining detection frames after the matching in S341, trying to match with any labeling frame, and if the intersection ratio between the two is greater than a preset value, matching the two.

Specifically, the specific calculation formula of the intersection ratio is as follows:

wherein, A is a preselection frame, B is a detection frame, and J (A, B) is the ratio of intersection and union of the preselection frame and the detection frame.

In a specific embodiment, the method further includes step S500:

s520: carrying out non-maximum value suppression processing on the detection frames, removing redundant detection frames and determining a unique detection frame;

s540: and outputting and displaying the detection result.

According to the technical scheme, compared with the prior art, the pedestrian detection method based on deep learning is provided, the VoVNet network replaces VGG to serve as a neural network model of the SSD network, and the original SSD convolution layer is connected in a short-circuit connection mode to form a short-circuit mechanism. The characteristic information of the shallow network can be connected to the deep network structure through the short circuit mechanisms, so that the deep neural network can fuse the shallow characteristic information, the shallow characteristic information is fully utilized, the characteristic information of the target is better expressed, and the precision of small target detection is improved. Moreover, a normalized BN operation and a nonlinear ReLU function operation are added between residual block convolution layers of the ResNet network and between adjacent residual structures, and the stability of the deep neural network structure is maintained. However, the ResNet network has a deep hierarchy, which means that the residual error result needs to be repeated for many times, so that the parameter utilization rate is low, and meanwhile, the problems of low operation speed, large occupied memory, low calculation efficiency, lack of small target detection precision and the like exist.

The method has the following specific beneficial effects:

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A pedestrian detection method based on deep learning is characterized by comprising the following steps:

2. The deep learning-based pedestrian detection method according to claim 1, wherein the K-means clustering is performed on the sample data set to obtain the optimal aspect ratio of the pre-selection frame in step S300, and the method comprises:

d＝1-IOU[(x_j,y_j,w_j,h_j),(x_j,y_j,W_j,H_j)]

j∈{1,2,…,N}，i∈{1,2,…,k}

3. The pedestrian detection method based on deep learning of claim 2, wherein step S300 specifically includes:

4. The pedestrian detection method based on deep learning of claim 3, wherein the step S310 is as follows:

5. The deep learning-based pedestrian detection method according to claim 3, wherein the step S320 of joining a VoVNet network structure comprises:

6. The pedestrian detection method based on deep learning of claim 3, wherein the step S100 further includes obtaining a test data set while obtaining the sample data set with the pedestrian target, testing the trained optimized SSD pedestrian detection model on the test data set, and outputting the tested trained optimized SSD pedestrian detection model.

7. The pedestrian detection method based on deep learning of claim 1, further comprising the step S500:

s540: and outputting and displaying the detection result.