AU2021104243A4

AU2021104243A4 - Method of Pedestrian detection based on multi-layer feature fusion

Info

Publication number: AU2021104243A4
Application number: AU2021104243A
Authority: AU
Inventors: Ziteng Li; Tianni Mei; Wen SUN; Congyao Wang
Original assignee: Individual
Current assignee: Individual
Priority date: 2021-07-16
Filing date: 2021-07-16
Publication date: 2021-09-09
Anticipated expiration: 2029-07-16

Abstract

This paper proposes a goal detection network of end-to-end multi-scale feature fusion because of the tiny pedestrian target and blocking in pedestrian detection. This algorithm is based on the YOLOv3 network, fully integrates multi-scale features, enhances the expression ability of small target features, improves the robustness of pedestrian detection in complex environments, and improves pedestrian detection accuracy based on guaranteeing real-time detection. In the experiment, the current mainstream pedestrian detection algorithm is compared. This algorithm effectively improves the detection accuracy in INRIA and KITTI data sets, and the average accuracy of Yolov3 in two different data sets is improved by 6% and 24.7%, respectively. 1 Figure 1 Figure 2

Description

Figure 1

Figure 2

TITLE Method of Pedestrian detection based on multi-layer feature fusion

FIELD OF THE INVENTION

This paper proposes a goal detection network of end-to-end multi-scale feature fusion because of the tiny pedestrian target and blocking in pedestrian detection.

BACKGROUND OF THE INVENTION

Pedestrian detection, as one of the essential tasks of unmanned driving systems, monitoring systems, and early warning protection systems, plays an essential role in many fields. With the development of science and technology in China, high-speed EMUs and new-type automobiles have gradually emerged on the roads, alongside the increasing population density and the total number of vehicles. However, there are always many drivers who do not obey the rules. The severe traffic situation inevitably poses a significant threat to the personal safety of pedestrians [1]-[4]. At present, in driving a car, the driver judges whether a pedestrian is ahead to take emergency measures. However, the pedestrian target is often tiny and blocked, so when the road is curved. The driver's sight is more easily affected, it is difficult to ensure the safety of pedestrians. Therefore, it is essential to develop a theory and method that can accurately and quickly detect pedestrians on the road, as it can ensure the safety of automobiles and protect the life and property of the public, which is of great practical significance. In response to the above issue, based on the improved YOLOv3 network, we came up with an end-to-end target detection network for pedestrian detection for in-vehicle scenarios. Incorporating multi-scale features, the network enhanced the expression ability of small target features, the robustness of pedestrian detection in complex environments, and pedestrian detection accuracy based on real-time detection. The main contributions of this paper are as follows: (1) We improved YOLOv3 and proposed an end-to-end target detection network incorporating multi-scale features. In the meantime, we encoded the semantic dependence between space and channels and incorporated shallower spatial features with deep semantic features. This helped us access a significantly improved detection accuracy rate of small targets without basically increasing the amount of calculation. (2) On a large-scale data set covering various environments and pedestrian types, we validated our improved network. We used the mosaic data enhancement method to further enrich the training samples without increasing the training time. On the detection side, K-means clustering was used to obtain anchor box parameters that are easier to learn and collect to receive a more accurate prediction of the location of the target area in the subsequent regression calculation.

SUMMARY OF THE INVENTION

Network backbones: The YOLOv3 network is a single-stage target detection method; unlike the target detection framework of the R-CNN series, the YOLOv3 network does not generate candidate boxes and returns the location of the bounding box and its category directly at the output layer [19]. YOLOv3 draws on the idea of ResNet, the FPN network, to add cross-layer jump connections, combining the characteristics of coarse and fine granularity and better enable detection tasks [24], [25]. Add multi-scale predictions, i.e. forecasts at three different feature layers and scale predictions with three anchor boxes. The anchor box is designed using clustering, divided into 9 cluster centres and divided into three feature layers equally by size. The dimensions are 13 x 13, 26 x 26, 52 x 52. The feature extraction network of YOLOv3 is Darknet-53, and its network structure is shown in Table 3.1. Convolutional in the Darknet-53 network represents a CBL operation that includes volume base, batch normalized BN layer, and LeakyRelu activation functions. For YOLOv3, the BN layer and the LeakyRelu are inseparable parts of the volume base, forming the minor components together. In addition, the Resn Residual module is included, with the numbers 1,2, 8, 8, 4 in Table 3.1 representing the number of residual units. Darknet-53 has deepened network structure, the processing speed of 78 graphs per second, slower than Therknet-19, but one times faster than the resNet-152 with the same accuracy, so Darknet-53 is a feature extraction network architecture that takes speed and precision into account [19].

Layer Filter size Repeat Output size Image 416*416 Conv 323*3/1 1416*416 Conv 643*3/2 1208*208 Conv 32 1*1/1 olonv - 208*208 Conv 643*3/1 iConv lx1 208*208 Residual IResidual 208*208 Conv 1283*3/2 1104*104 Conv 64 1*1/1 1Conv I 104*104 Conv 1283*3/1 iConv 1x2 104*104 Residual IResidual 104*104 Conv 2563*3/2 152*52 Conv 128 1*1/1 ~onv - 52*52 Conv 2563*3/1 iConv IX8 52*52 Residual IResidual 52*52 Conv 5123*3/2 126*26 Conv 256 1*1/1 Conv - 26*26 Conv 5123*3/1 iConv IX8 26*26 Residual I Residual 26*26 Conv 10243*3/2 113*13 Conv 512 1*1/1 'Conv I 13*13 Conv 10243*3/1 IConv iX4 13*13 Residual IResidual 13*13 Table 3.1 The feature extraction network of YOLOv3 Multi-layer feature fusion Feature fusion integrates different types and scales of features, removing redundant information to get better feature expression[26]. Intuitive fusion in neural networks is generally divided into Add and Concatenate. Add method is the combination of feature diagrams so that the layer frame describes the amount of information of image features; that is, the dimension of the image itself has not increased, but the amount of information under each dimension has increased, such a fusion method is conducive to the image classification task. Concatenate is a combination of channel numbers, i.e. the characteristics that describe the image itself are increased, and the information under each feature does not increase. Our algorithm uses Concatenate to fuse features. The shallow features extracted by neural networks have high resolution to learn spatial features in an image, and the lower resolution of deep features can learn better semantic features. In order to better combine shallow images with deep images, we try to change the network structure and combine in-depth features with shallower features. Loss function The loss function of the algorithm consists of three parts, which are the positioning error of the bounding box, the error of confidence of the bounding box and the classification error of the bounding box, Positioning error of the bounding box: The positioning error of the bounding box mainly includes the centre coordinate error and the vast and high coordinate error, which represents when the j th anchor box of i th grid is responsible for a real target, then the box generated by this anchor box should be compared with the box of the actual target, and the central coordinate error and the wide-height coordinate error are calculated. The parameterI means the j th anchor box of the the i th grid is responsible for this object or not. If the parameter responsible for the object, obj= 1. Otherwise, I obj 0. S2 B

AcoordI obj [ i)- X +(yi-

+ i=0 j=0 S2 B 2 2

Acoord libj I +

i=0 j=0 Confidence error of the bounding box: Confidence indicates how confident the box is that there is indeed an object in the box and how confident the box is that the box includes all the characteristics of the entire object. Confidence errors are calculated regardless of whether or not the anchor box is responsible for a goal. Confidence errors are represented by cross-entropy. In the parameter C represents the actual value, and the value of is determined by whether the value of the grid cell's booking box is responsible for predicting an object. If responsible, then C = 1, otherwiseC =0. S2 B

Iobj[0 log(C) + (1 - €) log(1 - C)] +

i=0 j=0 S2 B noobjinoobj [ log(C) + (1 - )] log(1 - C) i=0 j=0 It should be noted that the loss function is divided into two parts: there are objects, there are no objects, there is no object loss part also increased the weighting factor. The reason for adding a weight factor is that for an image, most of the content generally does not contain objects to be detected, which results in no objects contributing more computationally than there are objects, which causes the network to tend to predict that cells do not contain objects. Therefore, reduce the contribution weight of the calculated portion of no object, such as the value is 0.5. Classification error of the bounding box: Classification error is also chosen as the loss function of cross-entropy. When the j th anchor box of the i th grid is responsible for a real target, then the bounding box generated by the anchor box will calculate the classification loss function. s2

Ij [Pii log(P)+(1- p)log(1-Pii)] i=0 cEclasses The loss function of YOLO v3 can be obtained by three parts as follows:

i=0 j=0 Sz B oj( I. 12 )2

Acoord Ls=oor j oj )+( F )]

Sz i=0 j=0 B

, obj[eilog(c)+ (1 - i)log(1 - c)] i=0 j=0 Sz B

Anoobj Y 0b j [i og(c i) +(1 - ei) og(1 - c ) i=0 j=0

, Ii Y, Piilog(Pi) + (1 - P))log(1 - ii)] i=0 cEclasses

DESCRIPTION OF DRAWING 1. Figure 1 shows the network backbone of convolutional neural networks. 2. Figure 2 shows the visual detection results for the INRIA data set. 3. Figure 3 shows the visual detection results for the KIT TI data set.

DESCRIPTION OF PREFERRED EMBODIMENT

Implementation details In this experiment, we use the PyTorch framework to implement the algorithm, and the data set is the pedestrian detection data set of INRIA and KIT TI. The INRIA data set includes 614 training sets and 288 verification sets, and there are up to 15 cars and 30 pedestrians in each image in the KITTI data set, with various degrees of occlusion and truncation. We selected 3500 images from the KIT TI data set for testing, including 3000 training sets and 500 verification sets. In this experiment, our purpose is to detect the pedestrians in the data set and compare it with the verification set, and we hope to get high accuracy in this process. We use the Git Clone instruction to clone the corresponding target detection YOLOV3 code base on Github in the first step. Secondly, we configure the environment and choose Python 3.8 according to the compatibility of different versions of this program. Then all kinds of libraries needed for this target detection are installed one by one, such as Matplotlib library about scientific computing, Opencv library about visual recognition, etc. In the second step, we detected the INRIA and KITTI data sets, but before this kind of step, we preprocessed the data. The so-called data preprocessing is to convert the PNG format of the tag file in the INRIA data set into the TXT format that YOLOv3 can recognize. Then we can generate the data file needed for training, and the data file can tell us the type of detection. The third step is to complete the network model configuration file processing, represented in the CFG format. Because 80 classes are detected in the original code, but pedestrian detection is only for human detection, we change the YOLO layer's value to 1. Then we need to change the Filters in front of the YOLO layer to 18. What is else, YOLOV3 has three output layers, so all three filters should be changed to 18. In the fourth step, we design the corresponding algorithm training process. The algorithm iterates over 100 rounds of and batch size could be set to 64. Then we adopt the multi-scale training of Multi-Scale, so we can randomly adjust the size of the input image, which can improve the robustness of the program model. Finally, we cut the image to a standard rectangle, thus reducing the amount of computation of the program. The last step is about writing the detection code, which is the verification of a result of a pedestrian detection model trained in the above steps. First of all, it is necessary to configure a network model in CFG format and then to provide a file source for detection. Because YOLOV3 has undergone five downsampling times, the value of image size needs to be set to a multiple of 32. Finally, you can start to observe the test results by configuring relevant instructions such as device, save-txt and so on. Data processing In the second step of the operation step, we need to preprocess the data. The so-called data preprocessing is to convert the PNG format of the tag file in the INRIA dataset into the TXT format recognized by YOLOv3. It is worth mentioning that each line represents a target in the TXT file, and each target corresponds to five values. From left to right, they are the category of detection, the proportion of the Abscissa of the centre point relative to the width of the picture, the proportion of the ordinate of the centre point relative to the height of the picture, the ratio of the width of the detection box to the width of the picture, and the ratio of the height of the detection box to the height of the picture. It is worth mentioning that these five values are all in the range of 0-1 because they are normalized. After these steps, the Data file needed for training can be generated, and the data file can tell us the type of test. Experimental results We detect the INRIA and KITTI data sets, respectively. The visual detection results for the INRIA data set and the KITTI data set are shown in figure 4.1 and figure 4.2, respectively. As what can be seen easily from these figures, after training, there will be a corresponding detection box for each pedestrian; this detection box probably determines the location of pedestrians, then there is a confidence value at the top of the detection box, which is a number in the range of 0-1, indicating the probability of pedestrians in the detection area. Moreover, after improving the skeleton network structure of YOLOV3, after 20 iterations of training, we input 512-512 images to compare the accuracy of the verification set. It is easy to find that the new model's accuracy has been improved to a certain extent. First, it is shown in figure 4.3, which represents the changes in the values of precision, mAP@0.5, Recall, and F1 in the INRIA dataset. Among these four values, precision represents the accuracy, which indicates that the actual number of positive samples accounts for the number of positive samples that the network considers to be positive samples, so it directly reflects the accuracy of the target detection network. Secondly, Recall represents the recall rate, which indicates the proportion of the real positive samples identified by the network to the actual positive samples, which also directly reflects the accuracy of detection to some extent. Then the F1 is calculated from the values of Recall and precis, and its purpose is to locate the harmonic average of Precision and Recall. Finally, the value of mAP@0.5 can also reflect the accuracy of detection to a certain extent. In figure 4.3 , the values of Precision, mAP@0.5 and FI have been significantly improved, but the record value has decreased, which proves that the change of the skeleton network is practical. However, the value mAP@0.5 did not improve significantly, only increased from 0.915 to 0.921; this is because there are fewer small targets in the INRIA data set, but our change mainly in the fusion with the shallower features, which is an introduction of more low-level spatial information, it mainly improves the accuracy of small target detection, so this not reflected in the change of numerical mAP@0.5. Figure 4.4 shows the changes of Precision, mAP@0.5, Recall and FI in the KITTI dataset. Because there are more small targets in the KITTI data set, so it better reflects the impact of changes in our YOLOv3 skeleton network on detection. Finally, the experimental results show that the four reference values of precision, mAP@0.5, Recall and Fl are all improved. The reference value mAP@0.5 is increased from 0.18 to 0.427, which shows that the accuracy of the network for small target detection is greatly improved.

Index Precision Recall mAP@0.5 F1 YOLOv3 0.834 0.891 0.915 0.862 Modified YOLOv3 0.889 0.876 0.921 0.882 Figure 4.3 Dataset Categor KIfTII Index Precision Recall mAP@0.5 F1 YOLOv3 0.447 0.173 0.18 0.25 Modified YOLOQv3 0.616 0.425 0.427 0.503 Figure 4.4 Conclusion This essay is based on the improvement of the YOLOv3 algorithm, an end-to-end multi-scale feature fusion target detection network is proposed for target detection in pedestrian scenes. The algorithm encodes the semantic dependence between space and channels, fuses multi-scale features, and significantly improves the accuracy of small target detection without increasing the amount of computation. In addition, we validate the algorithm on a large-scale data set covering a variety of environments and pedestrian types, and use mosaic data enhancement method to further enrich the training samples without enhancing the training time, and obtain a more accurate anchor frame through K-means clustering at the detection end, so that the location of the target area can be predicted more accurately in the subsequent regression calculation. In the experimental results, the pedestrian detection accuracy of the INRIA data set has been slightly improved. In contrast, the pedestrian detection accuracy of the KITTI data set with many small targets and a more complex environment has been significantly improved. This proves that the network enhances the expression ability of small target features, enhances the robustness of pedestrian detection in a complex environment, and dramatically improves pedestrian detection accuracy.

Claims

CLAIM

1. Method of pedestrian detection based on multi-layer feature fusion, characterized in that it comprises the following steps: use the PyTorch framework to implement the algorithm, and the data set is the pedestrian detection data set of INRIA and KITTI.

Figure 2 Figure 1

Figure 3