CN117523612A

CN117523612A - Dense pedestrian detection method based on Yolov5 network

Info

Publication number: CN117523612A
Application number: CN202311540235.5A
Authority: CN
Inventors: 韦宇宁; 胡奇; 李鹏程
Original assignee: Changchun University of Science and Technology
Current assignee: Changchun University of Science and Technology
Priority date: 2023-11-20
Filing date: 2023-11-20
Publication date: 2024-02-06

Abstract

The invention belongs to the technical field of deep learning image processing, in particular to a dense pedestrian detection method based on a Yolov5 network, which comprises the following steps: s1, acquiring a data set and constructing a Yolov5 network model; s2, inputting a dense pedestrian data set picture, extracting picture features of dense pedestrian images by a trunk feature extraction network, and enhancing pedestrian features by a feature enhancement network; and S3, the detection head detects the characteristics output by the characteristic enhancement module, performs non-maximum suppression algorithm screening, and outputs a detection result. The invention adopts the Yolov 5-based network, can carry out pedestrian detection processing on dense pedestrian images, can solve the problems of false detection and missing detection of the existing image detection technology, adds deformable convolution into a model, is helpful for enhancing the detail and texture information of the images during feature extraction, and uses a DIOU loss function to pay more attention to separating dense pedestrian areas when the network carries out non-maximum suppression, thereby improving the accuracy rate of detected images.

Description

Dense pedestrian detection method based on Yolov5 network

Technical Field

The invention relates to the technical field of deep learning image processing, in particular to a dense pedestrian detection method based on a Yolov5 network.

Background

Pedestrian detection is a special case of object detection and plays a vital role in specific applications, such as automatic driving automobiles, intelligent monitoring systems, robotics, advanced human-computer interaction and the like. In addition, pedestrian detection is also the basis of numerous research topics, such as target tracking, human body posture estimation, pedestrian searching, and the like. Specifically, according to incomplete statistics, the number of dead people in China is increased year by year due to frequent occurrence of traffic accidents, drivers are mainly responsible in the traffic accidents, and the most serious victim group in the accidents is pedestrians. The intelligent driving auxiliary system can successfully detect obstacles and pedestrians in front of the vehicle, and prompt a driver to avoid the obstacles in the running process of the vehicle, so that the probability of collision with the pedestrians is reduced. The video image acquisition device applied to the street and the alley can protect legal rights and interests all the time. Pedestrian detection means that a pedestrian in an image is identified, and a specific position where the pedestrian is located is framed with a specific rectangular frame.

At present, the pedestrian detection technology faces a plurality of challenges, and the requirement of detecting the image by the traditional pedestrian detection algorithm is difficult to meet. In recent years, computer hardware technology has rapidly developed, and continuous updating of Graphics Processing Units (GPUs) enables reliable hardware support for processing image data. Meanwhile, researchers propose a series of excellent target detection algorithms based on deep learning, and convolutional neural networks are used for image detection in the algorithms, so that the image detection is rapid and accurate. However, with the complexity of application scenes, the gestures and shielding situations of pedestrians in the images also tend to be diversified, so that the task of pedestrian detection is more challenging.

In recent years, various technical methods for dense pedestrian detection are proposed at home and abroad, but the dense pedestrian detection still has the following problems:

1. the difference in the appearance of pedestrians is large. Including viewing angle, attitude, apparel and attachment, illumination, imaging distance, etc. The appearance of pedestrian information seen from different viewing angles varies. Pedestrians in different postures have different appearance differences. The appearance is also greatly different due to the different wearing of clothes and wearing of clothes by a person, such as wearing of hats, scarf, luggage and the like. The difference in illumination environments also adds some complexity. The distance of the imaging distance directly influences the size of a pedestrian sample in the image, the size of a human body at a long distance is smaller, the size of a human body at a short distance is smaller, and the appearance difference is obvious;

2. sample occlusion problem. In many practical application scenarios, pedestrians are densely distributed, such as airports, stations and pedestrian areas are severely blocked, and inter-class blocking and intra-class blocking are blocked. The inter-class shielding is shielding among different objects, namely shielding between a pedestrian and other objects; an intra-class occlusion is an occlusion between the same objects, i.e., an occlusion that exists between a pedestrian and a pedestrian. The shielding problem brings greater challenges to pedestrian detection;

3. complex background. The environmental background in most real world is not single and unchanged, for example, the appearance, shape, color and texture of some objects in urban street view are similar to those of pedestrians, and also a sample image in daytime and a sample image at night also have completely different background environments, and the complex background can bring a certain influence to the pedestrian detection effect.

Disclosure of Invention

(one) solving the technical problems

Aiming at the defects of the prior art, the invention provides a dense pedestrian detection method based on a Yolov5 network, which solves the problems that the existing model has poor results on dense pedestrian images and the prior art has false detection.

(II) technical scheme

The invention adopts the following technical scheme for realizing the purposes:

a dense pedestrian detection method based on a Yolov5 network comprises the following steps:

s1, acquiring a data set and constructing a Yolov5 network model;

s2, inputting a dense pedestrian data set picture, extracting picture features of dense pedestrian images by a trunk feature extraction network, and enhancing pedestrian features by a feature enhancement network;

s3, the detection head detects the characteristics output by the characteristic enhancement module, screens a non-maximum suppression algorithm and outputs a detection result;

and S4, objectively judging the dense pedestrian detection pictures.

Further, the main feature extraction network in S2 extracts the picture features of the dense pedestrian image, and the C3 module in the main feature extraction network uses a deformable convolution module to replace a common convolution module:

the deformable convolution module and the normal convolution module may be expressed as:

because the position of the deformable convolution module after adding the offset is non-integer and does not correspond to the pixel point actually existing on the feature map, interpolation is needed to obtain the offset pixel value, and bilinear interpolation can be generally adopted, and the general convolution and the x of the deformable convolution are calculated through bilinear interpolation, and the formula is expressed as follows:

further, the loss function in S3 uses DIOU instead of GIOU,

determining the GIOU center point distance and penalty term R, the DIoU penalty can be expressed as:

p and G are respectively predicted box and real box, and the central point is P respectively ₀ And G ₀ C represents the maximum area of the two boxes, ρ represents the euclidean distance of the two center points and the distance between the two box diagonals of L.

Further, the criterion of the step S4 is to measure the difference of image detection of the dense population in terms of accuracy, recall and average accuracy, and quantitatively evaluate the image detection effect;

the quantitative evaluation is to adopt an accuracy rate P, a recall rate R and an average accuracy rate AP respectively,

performing quantitative analysis, wherein the precision P is defined as the accuracy of all detected targets, and can be expressed as:

the recall R-defined as the detection accuracy in all positive samples, can be expressed as:

the average precision AP, defined as the average of the precision at different recall rates, may be expressed as:

TP is positive samples and is predicted to be the number of positive classes; FP is negative sample forecast to be positive number; FN is the number of negative classes predicted by positive samples; TN is the negative sample prediction negative class number.

(III) beneficial effects

Compared with the prior art, the invention provides a dense pedestrian detection method based on a Yolov5 network, which has the following beneficial effects:

the invention adopts the Yolov5 network, can detect pedestrians on dense pedestrian images, can solve the problems of false detection and missing detection of the existing pedestrian detection technology, adds deformable convolution in the model, is helpful for extracting details and texture information of pedestrian characteristics during characteristic extraction, ensures that the network pays more attention to the part with more characteristic information during characteristic learning of pedestrians, and improves the precision of the detected images; the invention can play a great role in a plurality of fields requiring clear images, such as target tracking, image classification, target detection and the like.

Drawings

FIG. 1 is a schematic diagram of a Yolov 5-based network structure according to the present invention;

FIG. 2 is a schematic diagram of a deformable convolution structure employed in the present invention;

FIG. 3 is a schematic diagram of a DIOU module according to the present invention;

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Examples

As shown in fig. 1-3, a dense pedestrian detection method based on a Yolov5 network according to an embodiment of the present invention includes the following steps:

s1: acquiring a data set and constructing a Yolov5 model;

the dataset included a training set and a test set, the dataset was a crowing human dataset, there were 15000 images in the training set, 5000 images in the test set, and the validation set contained 4370 images. In the training and validation set, there are 470K instances, approximately 23 individuals per picture, and various occlusions. Each instance of the person with the bounding box, the region bounding box and the person's bounding box is visible in the person's head.

The Yolov5 algorithm mainly comprises an Input module (Input), a trunk feature extraction module (Backbone), a feature enhancement module (Neck) and a detection Head (Head).

The Yolov5 algorithm divides the input image into N x N cells that are used to detect objects whose center coordinates lie within the grid. The cell predicts P bounding boxes, each containing five pieces of information. Therefore, the final predicted value of the single-picture input model is a tensor of n×n (p× 5+C) (C is the number of classes), and the model predicts n×n×p bounding boxes altogether. When the Yolov5 algorithm is used for target detection, a confidence threshold (generally set to 0.5) is set by the algorithm, a frame with the confidence of the prediction boundary frame smaller than the threshold is firstly screened out, a prediction frame with relatively high confidence is reserved by a model after preliminary screening, then a plurality of prediction frames of the same target are filtered by the NMS (Non-Maximum Suppression) Non-maximum suppression algorithm, and the optimal prediction frame of the target is reserved. Due to various shielding and complicated background of dense scenes, screening can cause missed detection and false detection problems directly according to confidence. Based on the defects in the prior art, the invention aims to provide a dense pedestrian detection method based on Yolov5, so as to achieve the method with higher detection precision and greatly reduce false detection.

S2: inputting a dense pedestrian data set picture, extracting picture features of dense pedestrian images by a trunk feature extraction network, and enhancing pedestrian features by a feature enhancement network;

the main function of the input end is to preprocess the input picture, and the main preprocessing method comprises the steps of self-adaptive picture scaling, mosaic data enhancement and self-adaptive anchor frame calculation. Firstly, an input picture with any size is subjected to self-adaptive picture scaling, the picture is scaled to a fixed size according to the length-width ratio of the picture, and compared with common picture scaling, the self-adaptive picture scaling can adaptively add the least black edge to the original picture, so that the information redundancy caused by excessive black edges is effectively solved, and the detection speed is improved. And then, the obtained fixed-size pictures are subjected to Mosaic data enhancement, four pictures are randomly selected and spliced in a random scaling, random cutting and random arrangement mode, so that the data set is enriched, and meanwhile, the GPU calculation amount is reduced. Finally, self-adaptive anchor frame calculation is used for different data sets, and the proper initial anchor frame size is set in advance, so that more accurate positioning of targets in subsequent detection is facilitated.

The main function of the trunk feature extraction network is to extract the position information and semantic information of the target to be detected, and the main structure comprises Conv, C3 and SPPF. The Conv structure consists of standard convolution, BN, siLU activation functions. The C3 structure is composed of a standard convolution and a bottleneck module, and compared with layer-by-layer convolution, the C3 structure can effectively simplify a network, reduces the calculated amount while fully extracting the characteristics, and reduces the reasoning time of an algorithm. The SPPF is formed by the maximum pooling of three convolution kernels with the size of 5*5, and can fuse a plurality of features with different resolutions at the same time, so that more dense pedestrian effective information is obtained

The main function of the feature aggregation network is to fuse the position information and semantic information of the feature map, and the feature pyramid and path aggregation network (PathAggregationNetwork, PAN) structure is adopted. The FPN adopts a top-down structure, and rich semantic information contained in the deep feature map is transferred downwards through an up-sampling operation and is fused with the shallow feature map. The PAN adopts a bottom-up structure, and rich position information contained in the shallow feature map is transferred upwards and fused with the deep feature map through a downsampling operation. At the moment, the shallow feature map obtains semantic information transmitted by the deep feature map, the deep feature map obtains position information transmitted by the shallow feature map, multi-scale detection of the network is realized while the depth feature map information is fused, and generalization capability of the network is enhanced.

The improvement scheme is that a deformable convolution module is added in C3 modules in a trunk feature extraction network, the deformable convolution module is replaced in standard cone convolution modules in the last 3C 3 modules in the trunk feature extraction network, the capability of the trunk feature extraction network for extracting dense pedestrian features is improved, and interference of complex background on feature extraction is restrained.

The deformable convolution module (DCN) changes the position of the sampling point during the training phase by studying migration. As shown in fig. 2, in the conventional convolution operation, a feature map is divided into portions of the same size as a convolution kernel, and then the convolution operation is performed, and the positions of the portions on the feature map are determined. Because of its drawbacks, we feel that feature extraction is limited and cannot extract more background features. To solve this problem we have employed a Deformable Convolution (DCN) with geometric capabilities and a large acceptance domain. So that as many sample points as possible identify the target.

Further, p _n Is p ₀ Each offset within the convolution kernel range, whereas the deformable convolution introduces an offset, typically a fraction, for each point based on the conventional convolution, the offset being generated by convolving the input signature with another. The outputs of the general convolution and the deformable convolution are respectively:

further, since the position after adding the offset is not an integer and does not correspond to the pixel point actually existing on the feature map, interpolation is required to obtain the offset pixel value, and bilinear interpolation may be generally used, and the x of the convolution and the deformable convolution is generally calculated through bilinear interpolation. The formula is:

s3: the detection head detects the characteristics output by the characteristic enhancement module, screens a non-maximum suppression algorithm and outputs a detection result;

the main function of the detection layer is to detect on feature graphs with different scales through preset anchor frames, and the preset tracing frames generally use a non-maximum value inhibition method to finally obtain target classification and position information. The improvement is to replace the GIOU loss function with the DIOU loss function, thereby more efficiently separating dense features. The main part of the detection layer is the three-scale detector in the Head component, assuming that the picture size at the input end of the network is 640 x 640, and the sizes thereof are 80 x 80, 40 x 40, and 20 x 20 from top to bottom, respectively.

The IOU loss function formula is:

IoU(P,G)＝(P∩G)/(P∪G)

further, the IOU is expressed in mathematical set language as the intersection of two regions divided by the union of two regions, however, ioU (P, G) =0 when the prediction block does not intersect with the true block. To solve this drawback, YOLOv5 uses GIOU and penalty term r= |c-P u g|/|c|. Even though the GIoU solves the problem of two boxes not intersecting. But when one box encapsulates another, the GIOU is degenerated to the IOU.

Further considering the center point distance and penalty term R, we elaborate on the DIoU penalty:

further P and G predict box and real box, respectively. The center points are respectively P ₀ And G ₀ C represents the maximum area of the two boxes, ρ represents the euclidean distance of the two center points and the distance between the two box diagonals of L.

S4: objectively distinguishing the dense pedestrian detection pictures;

the quantitative evaluation is to adopt an accuracy rate P (Precision), a Recall rate R (Recall) and an average accuracy rate AP (Average Precision) respectively

Finally, it should be noted that: the foregoing description is only a preferred embodiment of the present invention, and the present invention is not limited thereto, but it is to be understood that modifications and equivalents of some of the technical features described in the foregoing embodiments may be made by those skilled in the art, although the present invention has been described in detail with reference to the foregoing embodiments. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The dense pedestrian detection method based on the Yolov5 network is characterized by comprising the following steps of:

s1, acquiring a data set and constructing a Yolov5 network model;

and S4, objectively judging the dense pedestrian detection pictures.

2. The method for dense pedestrian detection based on the Yolov5 network according to claim 1, wherein the method comprises the following steps: and in the step S2, the main feature extraction network extracts the picture features of the dense pedestrian images, and a deformable convolution module is used for replacing a common convolution module by a C3 module in the main feature extraction network:

3. the method for dense pedestrian detection based on the Yolov5 network according to claim 1, wherein the method comprises the following steps: the loss function in S3 uses DIOU instead of GIOU,

4. The method for dense pedestrian detection based on the Yolov5 network according to claim 1, wherein the method comprises the following steps: the judgment standard of the step S4 is to measure the difference of the image detection of the dense population in terms of accuracy rate, recall rate and average accuracy rate, and quantitatively evaluate the image detection effect;