AU2021104243A4 - Method of Pedestrian detection based on multi-layer feature fusion - Google Patents

Method of Pedestrian detection based on multi-layer feature fusion Download PDF

Info

Publication number
AU2021104243A4
AU2021104243A4 AU2021104243A AU2021104243A AU2021104243A4 AU 2021104243 A4 AU2021104243 A4 AU 2021104243A4 AU 2021104243 A AU2021104243 A AU 2021104243A AU 2021104243 A AU2021104243 A AU 2021104243A AU 2021104243 A4 AU2021104243 A4 AU 2021104243A4
Authority
AU
Australia
Prior art keywords
detection
pedestrian
network
pedestrian detection
yolov3
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
AU2021104243A
Inventor
Ziteng Li
Tianni Mei
Wen SUN
Congyao Wang
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to AU2021104243A priority Critical patent/AU2021104243A4/en
Application granted granted Critical
Publication of AU2021104243A4 publication Critical patent/AU2021104243A4/en
Ceased legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/103Static body considered as a whole, e.g. static pedestrian or occupant recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

This paper proposes a goal detection network of end-to-end multi-scale feature fusion because of the tiny pedestrian target and blocking in pedestrian detection. This algorithm is based on the YOLOv3 network, fully integrates multi-scale features, enhances the expression ability of small target features, improves the robustness of pedestrian detection in complex environments, and improves pedestrian detection accuracy based on guaranteeing real-time detection. In the experiment, the current mainstream pedestrian detection algorithm is compared. This algorithm effectively improves the detection accuracy in INRIA and KITTI data sets, and the average accuracy of Yolov3 in two different data sets is improved by 6% and 24.7%, respectively. 1 Figure 1 Figure 2

Description

Figure 1
Figure 2
TITLE Method of Pedestrian detection based on multi-layer feature fusion
FIELD OF THE INVENTION
This paper proposes a goal detection network of end-to-end multi-scale feature fusion because of the tiny pedestrian target and blocking in pedestrian detection.
BACKGROUND OF THE INVENTION
Pedestrian detection, as one of the essential tasks of unmanned driving systems, monitoring systems, and early warning protection systems, plays an essential role in many fields. With the development of science and technology in China, high-speed EMUs and new-type automobiles have gradually emerged on the roads, alongside the increasing population density and the total number of vehicles. However, there are always many drivers who do not obey the rules. The severe traffic situation inevitably poses a significant threat to the personal safety of pedestrians [1]-[4]. At present, in driving a car, the driver judges whether a pedestrian is ahead to take emergency measures. However, the pedestrian target is often tiny and blocked, so when the road is curved. The driver's sight is more easily affected, it is difficult to ensure the safety of pedestrians. Therefore, it is essential to develop a theory and method that can accurately and quickly detect pedestrians on the road, as it can ensure the safety of automobiles and protect the life and property of the public, which is of great practical significance. In response to the above issue, based on the improved YOLOv3 network, we came up with an end-to-end target detection network for pedestrian detection for in-vehicle scenarios. Incorporating multi-scale features, the network enhanced the expression ability of small target features, the robustness of pedestrian detection in complex environments, and pedestrian detection accuracy based on real-time detection. The main contributions of this paper are as follows: (1) We improved YOLOv3 and proposed an end-to-end target detection network incorporating multi-scale features. In the meantime, we encoded the semantic dependence between space and channels and incorporated shallower spatial features with deep semantic features. This helped us access a significantly improved detection accuracy rate of small targets without basically increasing the amount of calculation. (2) On a large-scale data set covering various environments and pedestrian types, we validated our improved network. We used the mosaic data enhancement method to further enrich the training samples without increasing the training time. On the detection side, K-means clustering was used to obtain anchor box parameters that are easier to learn and collect to receive a more accurate prediction of the location of the target area in the subsequent regression calculation.
SUMMARY OF THE INVENTION
Network backbones: The YOLOv3 network is a single-stage target detection method; unlike the target detection framework of the R-CNN series, the YOLOv3 network does not generate candidate boxes and returns the location of the bounding box and its category directly at the output layer [19]. YOLOv3 draws on the idea of ResNet, the FPN network, to add cross-layer jump connections, combining the characteristics of coarse and fine granularity and better enable detection tasks [24], [25]. Add multi-scale predictions, i.e. forecasts at three different feature layers and scale predictions with three anchor boxes. The anchor box is designed using clustering, divided into 9 cluster centres and divided into three feature layers equally by size. The dimensions are 13 x 13, 26 x 26, 52 x 52. The feature extraction network of YOLOv3 is Darknet-53, and its network structure is shown in Table 3.1. Convolutional in the Darknet-53 network represents a CBL operation that includes volume base, batch normalized BN layer, and LeakyRelu activation functions. For YOLOv3, the BN layer and the LeakyRelu are inseparable parts of the volume base, forming the minor components together. In addition, the Resn Residual module is included, with the numbers 1,2, 8, 8, 4 in Table 3.1 representing the number of residual units. Darknet-53 has deepened network structure, the processing speed of 78 graphs per second, slower than Therknet-19, but one times faster than the resNet-152 with the same accuracy, so Darknet-53 is a feature extraction network architecture that takes speed and precision into account [19].
Layer Filter size Repeat Output size Image 416*416 Conv 323*3/1 1416*416 Conv 643*3/2 1208*208 Conv 32 1*1/1 olonv - 208*208 Conv 643*3/1 iConv lx1 208*208 Residual IResidual 208*208 Conv 1283*3/2 1104*104 Conv 64 1*1/1 1Conv I 104*104 Conv 1283*3/1 iConv 1x2 104*104 Residual IResidual 104*104 Conv 2563*3/2 152*52 Conv 128 1*1/1 ~onv - 52*52 Conv 2563*3/1 iConv IX8 52*52 Residual IResidual 52*52 Conv 5123*3/2 126*26 Conv 256 1*1/1 Conv - 26*26 Conv 5123*3/1 iConv IX8 26*26 Residual I Residual 26*26 Conv 10243*3/2 113*13 Conv 512 1*1/1 'Conv I 13*13 Conv 10243*3/1 IConv iX4 13*13 Residual IResidual 13*13 Table 3.1 The feature extraction network of YOLOv3 Multi-layer feature fusion Feature fusion integrates different types and scales of features, removing redundant information to get better feature expression[26]. Intuitive fusion in neural networks is generally divided into Add and Concatenate. Add method is the combination of feature diagrams so that the layer frame describes the amount of information of image features; that is, the dimension of the image itself has not increased, but the amount of information under each dimension has increased, such a fusion method is conducive to the image classification task. Concatenate is a combination of channel numbers, i.e. the characteristics that describe the image itself are increased, and the information under each feature does not increase. Our algorithm uses Concatenate to fuse features. The shallow features extracted by neural networks have high resolution to learn spatial features in an image, and the lower resolution of deep features can learn better semantic features. In order to better combine shallow images with deep images, we try to change the network structure and combine in-depth features with shallower features. Loss function The loss function of the algorithm consists of three parts, which are the positioning error of the bounding box, the error of confidence of the bounding box and the classification error of the bounding box, Positioning error of the bounding box: The positioning error of the bounding box mainly includes the centre coordinate error and the vast and high coordinate error, which represents when the j th anchor box of i th grid is responsible for a real target, then the box generated by this anchor box should be compared with the box of the actual target, and the central coordinate error and the wide-height coordinate error are calculated. The parameterI means the j th anchor box of the the i th grid is responsible for this object or not. If the parameter responsible for the object, obj= 1. Otherwise, I obj 0. S2 B
AcoordI obj [ i)- X +(yi-
+ i=0 j=0 S2 B 2 2
Acoord libj I +
i=0 j=0 Confidence error of the bounding box: Confidence indicates how confident the box is that there is indeed an object in the box and how confident the box is that the box includes all the characteristics of the entire object. Confidence errors are calculated regardless of whether or not the anchor box is responsible for a goal. Confidence errors are represented by cross-entropy. In the parameter C represents the actual value, and the value of is determined by whether the value of the grid cell's booking box is responsible for predicting an object. If responsible, then C = 1, otherwiseC =0. S2 B
Iobj[0 log(C) + (1 - €) log(1 - C)] +
i=0 j=0 S2 B noobjinoobj [ log(C) + (1 - )] log(1 - C) i=0 j=0 It should be noted that the loss function is divided into two parts: there are objects, there are no objects, there is no object loss part also increased the weighting factor. The reason for adding a weight factor is that for an image, most of the content generally does not contain objects to be detected, which results in no objects contributing more computationally than there are objects, which causes the network to tend to predict that cells do not contain objects. Therefore, reduce the contribution weight of the calculated portion of no object, such as the value is 0.5. Classification error of the bounding box: Classification error is also chosen as the loss function of cross-entropy. When the j th anchor box of the i th grid is responsible for a real target, then the bounding box generated by the anchor box will calculate the classification loss function. s2
Ij [Pii log(P)+(1- p)log(1-Pii)] i=0 cEclasses The loss function of YOLO v3 can be obtained by three parts as follows:
i=0 j=0 Sz B oj( I. 12 )2
Acoord Ls=oor j oj )+( F )]
Sz i=0 j=0 B
, obj[eilog(c)+ (1 - i)log(1 - c)] i=0 j=0 Sz B
Anoobj Y 0b j [i og(c i) +(1 - ei) og(1 - c ) i=0 j=0
, Ii Y, Piilog(Pi) + (1 - P))log(1 - ii)] i=0 cEclasses
DESCRIPTION OF DRAWING 1. Figure 1 shows the network backbone of convolutional neural networks. 2. Figure 2 shows the visual detection results for the INRIA data set. 3. Figure 3 shows the visual detection results for the KIT TI data set.
DESCRIPTION OF PREFERRED EMBODIMENT
Implementation details In this experiment, we use the PyTorch framework to implement the algorithm, and the data set is the pedestrian detection data set of INRIA and KIT TI. The INRIA data set includes 614 training sets and 288 verification sets, and there are up to 15 cars and 30 pedestrians in each image in the KITTI data set, with various degrees of occlusion and truncation. We selected 3500 images from the KIT TI data set for testing, including 3000 training sets and 500 verification sets. In this experiment, our purpose is to detect the pedestrians in the data set and compare it with the verification set, and we hope to get high accuracy in this process. We use the Git Clone instruction to clone the corresponding target detection YOLOV3 code base on Github in the first step. Secondly, we configure the environment and choose Python 3.8 according to the compatibility of different versions of this program. Then all kinds of libraries needed for this target detection are installed one by one, such as Matplotlib library about scientific computing, Opencv library about visual recognition, etc. In the second step, we detected the INRIA and KITTI data sets, but before this kind of step, we preprocessed the data. The so-called data preprocessing is to convert the PNG format of the tag file in the INRIA data set into the TXT format that YOLOv3 can recognize. Then we can generate the data file needed for training, and the data file can tell us the type of detection. The third step is to complete the network model configuration file processing, represented in the CFG format. Because 80 classes are detected in the original code, but pedestrian detection is only for human detection, we change the YOLO layer's value to 1. Then we need to change the Filters in front of the YOLO layer to 18. What is else, YOLOV3 has three output layers, so all three filters should be changed to 18. In the fourth step, we design the corresponding algorithm training process. The algorithm iterates over 100 rounds of and batch size could be set to 64. Then we adopt the multi-scale training of Multi-Scale, so we can randomly adjust the size of the input image, which can improve the robustness of the program model. Finally, we cut the image to a standard rectangle, thus reducing the amount of computation of the program. The last step is about writing the detection code, which is the verification of a result of a pedestrian detection model trained in the above steps. First of all, it is necessary to configure a network model in CFG format and then to provide a file source for detection. Because YOLOV3 has undergone five downsampling times, the value of image size needs to be set to a multiple of 32. Finally, you can start to observe the test results by configuring relevant instructions such as device, save-txt and so on. Data processing In the second step of the operation step, we need to preprocess the data. The so-called data preprocessing is to convert the PNG format of the tag file in the INRIA dataset into the TXT format recognized by YOLOv3. It is worth mentioning that each line represents a target in the TXT file, and each target corresponds to five values. From left to right, they are the category of detection, the proportion of the Abscissa of the centre point relative to the width of the picture, the proportion of the ordinate of the centre point relative to the height of the picture, the ratio of the width of the detection box to the width of the picture, and the ratio of the height of the detection box to the height of the picture. It is worth mentioning that these five values are all in the range of 0-1 because they are normalized. After these steps, the Data file needed for training can be generated, and the data file can tell us the type of test. Experimental results We detect the INRIA and KITTI data sets, respectively. The visual detection results for the INRIA data set and the KITTI data set are shown in figure 4.1 and figure 4.2, respectively. As what can be seen easily from these figures, after training, there will be a corresponding detection box for each pedestrian; this detection box probably determines the location of pedestrians, then there is a confidence value at the top of the detection box, which is a number in the range of 0-1, indicating the probability of pedestrians in the detection area. Moreover, after improving the skeleton network structure of YOLOV3, after 20 iterations of training, we input 512-512 images to compare the accuracy of the verification set. It is easy to find that the new model's accuracy has been improved to a certain extent. First, it is shown in figure 4.3, which represents the changes in the values of precision, mAP@0.5, Recall, and F1 in the INRIA dataset. Among these four values, precision represents the accuracy, which indicates that the actual number of positive samples accounts for the number of positive samples that the network considers to be positive samples, so it directly reflects the accuracy of the target detection network. Secondly, Recall represents the recall rate, which indicates the proportion of the real positive samples identified by the network to the actual positive samples, which also directly reflects the accuracy of detection to some extent. Then the F1 is calculated from the values of Recall and precis, and its purpose is to locate the harmonic average of Precision and Recall. Finally, the value of mAP@0.5 can also reflect the accuracy of detection to a certain extent. In figure 4.3 , the values of Precision, mAP@0.5 and FI have been significantly improved, but the record value has decreased, which proves that the change of the skeleton network is practical. However, the value mAP@0.5 did not improve significantly, only increased from 0.915 to 0.921; this is because there are fewer small targets in the INRIA data set, but our change mainly in the fusion with the shallower features, which is an introduction of more low-level spatial information, it mainly improves the accuracy of small target detection, so this not reflected in the change of numerical mAP@0.5. Figure 4.4 shows the changes of Precision, mAP@0.5, Recall and FI in the KITTI dataset. Because there are more small targets in the KITTI data set, so it better reflects the impact of changes in our YOLOv3 skeleton network on detection. Finally, the experimental results show that the four reference values of precision, mAP@0.5, Recall and Fl are all improved. The reference value mAP@0.5 is increased from 0.18 to 0.427, which shows that the accuracy of the network for small target detection is greatly improved.
Index Precision Recall mAP@0.5 F1 YOLOv3 0.834 0.891 0.915 0.862 Modified YOLOv3 0.889 0.876 0.921 0.882 Figure 4.3 Dataset Categor KIfTII Index Precision Recall mAP@0.5 F1 YOLOv3 0.447 0.173 0.18 0.25 Modified YOLOQv3 0.616 0.425 0.427 0.503 Figure 4.4 Conclusion This essay is based on the improvement of the YOLOv3 algorithm, an end-to-end multi-scale feature fusion target detection network is proposed for target detection in pedestrian scenes. The algorithm encodes the semantic dependence between space and channels, fuses multi-scale features, and significantly improves the accuracy of small target detection without increasing the amount of computation. In addition, we validate the algorithm on a large-scale data set covering a variety of environments and pedestrian types, and use mosaic data enhancement method to further enrich the training samples without enhancing the training time, and obtain a more accurate anchor frame through K-means clustering at the detection end, so that the location of the target area can be predicted more accurately in the subsequent regression calculation. In the experimental results, the pedestrian detection accuracy of the INRIA data set has been slightly improved. In contrast, the pedestrian detection accuracy of the KITTI data set with many small targets and a more complex environment has been significantly improved. This proves that the network enhances the expression ability of small target features, enhances the robustness of pedestrian detection in a complex environment, and dramatically improves pedestrian detection accuracy.

Claims (1)

CLAIM
1. Method of pedestrian detection based on multi-layer feature fusion, characterized in that it comprises the following steps: use the PyTorch framework to implement the algorithm, and the data set is the pedestrian detection data set of INRIA and KITTI.
Figure 2 Figure 1
Figure 3
AU2021104243A 2021-07-16 2021-07-16 Method of Pedestrian detection based on multi-layer feature fusion Ceased AU2021104243A4 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2021104243A AU2021104243A4 (en) 2021-07-16 2021-07-16 Method of Pedestrian detection based on multi-layer feature fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
AU2021104243A AU2021104243A4 (en) 2021-07-16 2021-07-16 Method of Pedestrian detection based on multi-layer feature fusion

Publications (1)

Publication Number Publication Date
AU2021104243A4 true AU2021104243A4 (en) 2021-09-09

Family

ID=77589174

Family Applications (1)

Application Number Title Priority Date Filing Date
AU2021104243A Ceased AU2021104243A4 (en) 2021-07-16 2021-07-16 Method of Pedestrian detection based on multi-layer feature fusion

Country Status (1)

Country Link
AU (1) AU2021104243A4 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114841931A (en) * 2022-04-18 2022-08-02 西南交通大学 Real-time sleeper defect detection method based on pruning algorithm
CN115082688A (en) * 2022-06-02 2022-09-20 艾迪恩(山东)科技有限公司 Multi-scale feature fusion method based on target detection
CN117173748A (en) * 2023-11-03 2023-12-05 杭州登虹科技有限公司 Video humanoid event extraction system based on humanoid recognition and humanoid detection

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114841931A (en) * 2022-04-18 2022-08-02 西南交通大学 Real-time sleeper defect detection method based on pruning algorithm
CN115082688A (en) * 2022-06-02 2022-09-20 艾迪恩(山东)科技有限公司 Multi-scale feature fusion method based on target detection
CN117173748A (en) * 2023-11-03 2023-12-05 杭州登虹科技有限公司 Video humanoid event extraction system based on humanoid recognition and humanoid detection
CN117173748B (en) * 2023-11-03 2024-01-26 杭州登虹科技有限公司 Video humanoid event extraction system based on humanoid recognition and humanoid detection

Similar Documents

Publication Publication Date Title
AU2021104243A4 (en) Method of Pedestrian detection based on multi-layer feature fusion
Wang et al. A vision-based video crash detection framework for mixed traffic flow environment considering low-visibility condition
Lin et al. Helmet use detection of tracked motorcycles using cnn-based multi-task learning
Zhang et al. Prediction of pedestrian-vehicle conflicts at signalized intersections based on long short-term memory neural network
CN106372571A (en) Road traffic sign detection and identification method
CN106682092A (en) Target retrieval method and terminal
CN113657299A (en) Traffic accident determination method and electronic equipment
Park et al. Urban traffic accident risk prediction for knowledge-based mobile multimedia service
CN114419583A (en) Yolov4-tiny target detection algorithm with large-scale features
Deva Hema et al. Novel algorithm for multivariate time series crash risk prediction using CNN-ATT-LSTM model
Rahman et al. Predicting driver behaviour at intersections based on driver gaze and traffic light recognition
Can et al. Vehicle detection and counting under mixed traffic conditions in vietnam using yolov4
Chuanxia et al. Machine learning and IoTs for forecasting prediction of smart road traffic flow
US20230196772A1 (en) Query-oriented event recognition system and method
CN112215188A (en) Traffic police gesture recognition method, device, equipment and storage medium
CN110555425A (en) Video stream real-time pedestrian detection method
CN116304986A (en) Vehicle event fusion method, device, equipment and readable storage medium
CN115496978A (en) Image and vehicle speed information fused driving behavior classification method and device
CN111126271B (en) Bayonet snap image vehicle detection method, computer storage medium and electronic equipment
Deng et al. Pedestrian detection based on multi-layer feature fusion
CN114881096A (en) Multi-label class balancing method and device
Li et al. A Deep Multichannel Network Model for Driving Behavior Risk Classification
CN116541715B (en) Target detection method, training method of model, target detection system and device
Alam et al. Deep Learning Envisioned Accident Detection System
CN112837326B (en) Method, device and equipment for detecting carryover

Legal Events

Date Code Title Description
FGI Letters patent sealed or granted (innovation patent)
MK22 Patent ceased section 143a(d), or expired - non payment of renewal fee or expiry