CN112733749A

CN112733749A - Real-time pedestrian detection method integrating attention mechanism

Info

Publication number: CN112733749A
Application number: CN202110049426.6A
Authority: CN
Inventors: 冯宇平; 管玉宇; 刘宁; 杨旭睿; 赵文仓; 王明甲; 刘雪峰; 秦浩华; 王兆辉; 赵德钊
Original assignee: Qingdao University of Science and Technology
Current assignee: Qingdao University of Science and Technology
Priority date: 2021-01-14
Filing date: 2021-01-14
Publication date: 2021-04-30
Anticipated expiration: 2041-01-14
Also published as: CN112733749B

Abstract

The invention relates to a real-time pedestrian detection method integrating an attention mechanism, and belongs to the field of target detection. In order to improve the accuracy of the Tiny YOLOV3 target detection algorithm in the pedestrian detection task, the invention carries out research and improvement on the algorithm. The method comprises the steps of deepening a feature extraction network of the Tiny YOLOV3, and enhancing the network feature extraction capability; then, adding a channel domain attention mechanism into two detection scales of the prediction network respectively, giving different weights to different channels of the characteristic diagram, and guiding the network to pay more attention to the visible region of the pedestrian; and finally, improving the activation function and the loss function and adopting a K-means clustering algorithm to reselect the initial candidate box. The invention improves the detection precision of the pedestrian, keeps the faster detection speed and meets the real-time operation requirement.

Description

Real-time pedestrian detection method integrating attention mechanism

Technical Field

The invention relates to a real-time pedestrian detection method integrating an attention mechanism, and belongs to the technical field of target detection.

Background

With the development of scientific technology, pedestrian detection is more and more widely applied in daily life and industrial production. Because the image background containing the pedestrian is complex and is influenced by the problems of posture, wearing and shielding, the difficulty of pedestrian detection is greatly increased, and in an actual pedestrian detection system, higher accuracy and higher real-time performance are required, so that the method has very important practical significance for the research of pedestrian detection.

Conventional pedestrian detection algorithms typically employ methods of artificial feature extraction and classification. For example, journal, "a pedestrian detection model with local features", trains Adaboost classifiers with Haar features for different body parts, and detects pedestrians using a support vector machine. Journal pedestrian detection by improving features and GPU (graphic processing Unit) adopts SILTP texture features and gradient direction histograms to extract features of different parts of a human body, and pedestrian detection is realized by GPU acceleration. With the improvement of computer computing power, a target detection algorithm based on a convolutional neural network is proposed in succession. The commonly used methods at present comprise a two-stage detection algorithm R-CNN series and a one-stage detection algorithm SSD and YOLO series. The double-stage detection algorithm utilizes the selective search or the regional candidate network to generate the candidate region, and then the type and the position of the target are predicted, so that the target detection precision is improved. However, since the candidate area generation and the detection network are separately performed, it is difficult to achieve real-time object detection. The one-stage detection algorithm directly regresses the type and the position of the target, and has high detection speed. At present, a plurality of scholars are researching on pedestrian detection. For example, a gradual positioning fitting module is proposed in a document of "Learning effective single-stage geometric detectors by adaptive positioning fitting", which realizes the gradual positioning of pedestrians by using multiple scales, and improves the detection precision. The document "Dense connection and spatial gradient based on clustered YOLO for object detection" improves a feature extraction network on the basis of YOLOV2, proposes a YOLO target detection algorithm based on Dense connection and a spatial pyramid pooling structure, and balances detection accuracy and speed. The document "Pedestrian object detection with fusion of visual attribute mechanism and semantic fusion" utilizes the visual attention mechanism and Laplacian pyramid fusion method to determine Pedestrian saliency maps, achieving 92.78% detection accuracy on the INRIA dataset. The method effectively improves the pedestrian detection effect, but is not suitable for actual scenes, and for some scenes with high real-time requirements, the method not only requires high detection precision, but also requires high detection speed.

The YOLOV3 algorithm effectively improves the detection precision by using the structural design of a Feature Pyramid (FPN) and a residual error Network. However, the algorithm has a complex network structure and a large model volume, and is difficult to meet the real-time requirement on embedded equipment. The Tiny YOLOV3 is a simplified version of YOLOV3, and has the advantages of simple network structure, small model volume, high detection speed and low detection precision; meanwhile, the Tiny YOLOV3 uses the structural design of FPN to fuse the feature maps of two detection scales, but this way is only to connect the features of different channels in series, and cannot reflect the importance degree between the channels of the feature maps. Aiming at the problems, the invention optimizes and improves the algorithm of the Tiny Yolov 3. Firstly, deepening a backbone network by adopting a 3 multiplied by 3 convolution, and enhancing the feature extraction capability of the network; then, carrying out dimension reduction on the feature map by adopting 1 × 1 convolution, reducing the model parameter quantity, and realizing cross-channel information interaction; then, introducing a lightweight channel domain attention mechanism into the two prediction networks, fusing information of different scales by using the attention mechanism, giving different weights to different channels of the characteristic diagram, and guiding the networks to pay attention to pedestrian areas; and finally, optimizing a regression loss function and an activation function of the bounding box, and reselecting the initial candidate box by adopting a K-means clustering algorithm. Experimental results show that the improved Tiny Yoloov 3 has higher pedestrian detection precision, higher detection speed, less model parameters, small volume and suitability for real-time and embedded application.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a real-time pedestrian detection method integrating an attention mechanism, which is characterized in that the algorithm of the Tiny Yolov3 is optimized and improved: firstly, deepening a backbone network by adopting a 3 multiplied by 3 convolution, and enhancing the feature extraction capability of the network; then, carrying out dimension reduction on the feature map by adopting 1 × 1 convolution, reducing the model parameter quantity, and realizing cross-channel information interaction; then, introducing a lightweight channel domain attention mechanism into the two prediction networks, fusing information of different scales by using the attention mechanism, giving different weights to different channels of the characteristic diagram, and guiding the networks to pay attention to pedestrian areas; and finally, optimizing a regression loss function and an activation function of the bounding box, and reselecting the initial candidate box by adopting a K-means clustering algorithm. Experimental results show that the improved Tiny Yoloov 3 has higher pedestrian detection precision, higher detection speed, less model parameters, small volume and suitability for real-time and embedded application.

The invention discloses a real-time pedestrian detection method integrating an attention mechanism, which comprises the following steps of:

s1: selecting a Tiny Yoloov 3 algorithm, comprising the following steps:

s11: firstly, dividing an image into S multiplied by S grids, predicting B bounding boxes, confidence coefficient and C class probability by each grid, wherein the confidence coefficient formula is as follows:

wherein P (object) is the existence probability of the object in the grid,

the intersection ratio of the prediction frame and the real frame is obtained;

s12: the feature extraction network of the Tiny YOLOV3 is 7-layer convolution and 6-layer maximum pooling, meanwhile, multi-scale detection of the YOLOV3 is simplified, and prediction output is carried out on a feature map by adopting two detection scales of 26 multiplied by 26 and 13 multiplied by 13;

s2: deepening the feature extraction network, comprising the following steps:

s21: firstly, expanding the number of channels to 2 times of the number of the previous layer by 3 multiplied by 3 convolution, and extracting high-dimensional features;

s22: then, compressing the number of channels to 2 times of the original number by 1 × 1 convolution, reducing the channel dimension, reducing the calculated amount and realizing the cross-channel interaction of information;

s23: finally expanding the channel through 3 × 3 convolution to restore the original channel dimension;

s3: the prediction network fusing the attention of the channel: the method includes the steps that an attention mechanism is introduced into a prediction network of the Tiny YOLOV3, information of different scales is fused by the attention mechanism, different weights are given to characteristic channels, the network is guided to pay attention to pedestrian characteristics, and the influence of interference information is reduced, so that the detection accuracy is improved, and the method includes the following steps:

s31: introducing a non-dimensionality reduction lightweight channel domain attention mechanism ECA-Net, and inputting a characteristic diagram X belonging to R^H×W×CX has C characteristic channels;

s32: compressing the global spatial information through global average pooling, namely compressing the global spatial information on a spatial dimension H multiplied by W to obtain 1 multiplied by 1 weight information, wherein the global average pooling formula is as follows:

wherein, Y is the weight obtained after compression, and H multiplied by W is the space dimension information;

s33: in order to enable the network to automatically learn attention weights of different channels, cross-channel information interaction is completed by using one-dimensional convolution, the size of a one-dimensional convolution kernel is adaptively determined by a function of a channel dimension C, and a formula for calculating the size of the one-dimensional convolution kernel is as follows:

s34: and using the obtained convolution kernel for one-dimensional convolution, and obtaining the weight of each channel by using Sigmoid, wherein the formula is as follows:

ω_c＝σ(C1D_k(y)) (4)

where σ is the Sigmoid activation function, ω_cIs the generated channel attention weight with dimensions of 1 × 1 × C;

s35: then, the attention weight and the input feature map are weighted to realize the importance expression of the feature map channel, and the weighting formula is as follows:

wherein the content of the first and second substances,

denotes element-by-element multiplication, X_cIndicating an output result by the attention mechanism;

s4: improving the loss function and the activation function, including the following steps:

s41: during the training process, the Loss function of Tiny YOLOV3 can be divided into three parts, namely, bounding box regression Loss, confidence Loss and classification Loss, and the total Loss can be expressed by equation (6):

wherein i represents a scale;

s42: with the generalized cross-over ratio GIOU as the regression loss, IOU and GIOU are defined as follows:

wherein B represents a prediction box, B_gtRepresenting a real box, C representing a minimum closed surface containing the real box and a prediction box;

s43: the activation function is an important unit of the convolutional neural network, so that the network introduces a nonlinear factor, the model is not single any more, better learning of the network is facilitated, and the improved feature extraction network adopts a Mish activation function.

Preferably, in the step S1, the YOLO series algorithm is a one-stage target detection algorithm based on a convolutional neural network, and the Tiny YOLOV3 is a simplified version based on YOLOV 3.

Preferably, in the step S2, the feature extraction network of the Tiny YOLOV3 is shallow, so that deep features are difficult to extract, and the accuracy is low in pedestrian target detection; under the condition of overlarge calculated amount, by taking the idea of dense connection network as reference, before the increased 3 × 3 convolutional layer, the convolutional layer with the convolutional kernel size of 1 × 1 is introduced, and the channel dimension is reduced so as to reduce the calculated amount of the network.

Preferably, in step S3, in an actual pedestrian detection scene, the existence of the interference and the occlusion condition of the background information affects the extraction of the pedestrian features by the network, and further affects the pedestrian detection accuracy; the prediction network of the Tiny YOLOV3 fuses feature maps of two scales, and the fusion mode only connects the features in series in the channel dimension and cannot reflect the importance degree of the pedestrian features on certain channels.

Preferably, in step S32, the convolutional neural network can only learn the local receptive field, and cannot utilize the context information outside the region.

Preferably, in step S12, the feature map output by the two detection scales has sizes of 13 × 13 and 26 × 26, respectively, that is, the input image is divided into 13 × 13 and 26 × 26 grids, and pedestrians at long distance and short distance are detected, respectively, and each grid corresponds to a channel.

Preferably, in step S12, each grid is preset with 3 preselection frames, which are continuously adjusted during training, and an optimal preselection frame is selected as an output result; the different channels represent output parameters of each grid, taking a 13 × 13 feature map as an example, the parameters of each channel include the center coordinates (bx, by) of the prediction frame, the length and width (bw, bh) of the prediction frame, the confidence score p0 of the prediction frame, and the prediction score s of the pedestrian; each grid contains 3 prediction frames and each grid contains 6 parameters, so that the channel dimensions of the output feature map are all 18.

Preferably, in step S12, an ECA attention module is added to the prediction network that outputs a 13 × 13 feature map, the feature map after passing through the attention module is upsampled and connected in series with a 26 × 26 feature map, a feature map of 384-dimensional channels is output, and then weights are redistributed by the ECA attention module, so that the final two output layers will pay more attention to pedestrian information, thereby effectively reducing the influence of interference information and occlusion problems.

The invention has the beneficial effects that: the invention relates to a real-time pedestrian detection method integrating an attention mechanism, which aims to improve the accuracy of a Tiny Yolov3 target detection algorithm in a pedestrian detection task, and the algorithm is researched and improved; firstly, deepening a feature extraction network of the Tiny YOLOV3, and enhancing the network feature extraction capability; then, adding a channel domain attention mechanism into two detection scales of the prediction network respectively, giving different weights to different channels of the characteristic diagram, and guiding the network to pay more attention to the visible region of the pedestrian; finally, improving an activation function and a loss function and adopting a K-means clustering algorithm to reselect an initial candidate box; the experimental result shows that the accuracy of the improved Tiny Yolov3 algorithm reaches 77% on a VOC2007 pedestrian subset, is improved by 8.5% compared with that of the Tiny Yolov3, reaches 92.7% on an INRIA data set, is improved by 2.5%, and the running speed reaches 92.6 frames per second and 31.2 frames per second respectively; the invention improves the detection precision of the pedestrian, keeps the faster detection speed and meets the real-time operation requirement.

Drawings

FIG. 1 is a view showing a structure of a Tiny Yolov3 model.

FIG. 2 is a view of a Tiny Yoloov 3 modified model.

Fig. 3 is a diagram of the structure of the ECA model.

Fig. 4 is a diagram of a prediction layer structure.

Fig. 5 is a graph of the LeakyRelu and Mish activation functions.

Fig. 6(a) -6 (b) are graphs of AP variations under different data sets.

FIGS. 7(a) to 7(c) are graphs showing the change in the detection result of the Tiny Yolov 3.

FIGS. 8(a) to 8(c) are graphs showing the change in the detection results of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1:

the optimization of the Tiny YOLOV3 algorithm is improved. Firstly, deepening a backbone network by adopting a 3 multiplied by 3 convolution, and enhancing the feature extraction capability of the network; then, carrying out dimension reduction on the feature map by adopting 1 × 1 convolution, reducing the model parameter quantity, and realizing cross-channel information interaction; then, introducing a lightweight channel domain attention mechanism into the two prediction networks, fusing information of different scales by using the attention mechanism, giving different weights to different channels of the characteristic diagram, and guiding the networks to pay attention to pedestrian areas; and finally, optimizing a regression loss function and an activation function of the bounding box, and reselecting the initial candidate box by adopting a K-means clustering algorithm. Experimental results show that the improved Tiny Yoloov 3 has higher pedestrian detection precision, higher detection speed, less model parameters, small volume and suitability for real-time and embedded application.

S1: selecting a Tiny Yoloov 3 algorithm:

the YOLO series of algorithms are one-stage target detection algorithms based on convolutional neural networks. The algorithm divides an image into S multiplied by S grids, each grid predicts B bounding boxes, confidence coefficients and C class probabilities, and the confidence coefficient formula is as follows:

wherein P (object) is the existence probability of the object in the grid,

the intersection ratio of the prediction frame and the real frame is obtained.

The Tiny YOLOV3 is a simplified version based on YOLOV 3. Compared with a complex network structure of the YOLOV3, the Tiny YOLOV3 reduces the feature extraction network into 7-layer convolution and 6-layer maximum pooling (Maxpool), reduces the size of the model, simplifies the multi-scale detection of the YOLOV3, adopts two detection scales of 26 × 26 and 13 × 13 to predict and output the feature map, and the network structure is shown in fig. 1.

S2: deepening a feature extraction network;

the feature extraction network of the Tiny YOLOV3 is shallow, the deep features are difficult to extract, and the accuracy is low in pedestrian target detection. Therefore, the invention deepens the feature extraction network, adds 4 convolution layers with convolution kernel size of 3 multiplied by 3 on the basis of the original network, enhances the feature extraction capability and improves the detection precision. Although the accuracy of pedestrian detection can be improved by adding the convolutional layers, the parameter quantity of the model is increased sharply along with the superposition of the convolutional layers, and the calculation quantity and the occupation of memory resources are greatly increased.

Under the condition of overlarge calculated amount, the invention introduces the convolution layer with the convolution kernel size of 1 × 1 before the added 3 × 3 convolution layer by using the thought of a dense connection network, and reduces the channel dimension so as to reduce the calculated amount of the network. Specifically, firstly expanding the number of channels to 2 times of the number of the previous layer by 3 × 3 convolution to extract high-dimensional features; then, compressing the number of channels to 2 times of the original number by 1 × 1 convolution, reducing the channel dimension, reducing the calculated amount and realizing the cross-channel interaction of information; and finally expanding the channel through 3 multiplied by 3 convolution to restore the original channel dimension. The improved model structure is shown in fig. 2, where the left dashed box is the improved feature extraction network.

S3: the prediction network fusing the attention of the channel:

in an actual pedestrian detection scene, the interference of background information and the existence of shielding conditions influence the extraction of pedestrian characteristics by a network, and further influence the pedestrian detection precision. The prediction network of the Tiny YOLOV3 fuses feature maps of two scales, and the fusion mode only connects the features in series (contact) in the channel dimension and cannot reflect the importance degree of the pedestrian features on some channels. Therefore, the attention mechanism is introduced into a prediction network of the Tiny Yolov3, information of different scales is fused by the attention mechanism, different weights are given to characteristic channels, the network is guided to pay attention to pedestrian characteristics, the influence of interference information is reduced, and therefore detection accuracy is improved, and a right dotted line frame in fig. 2 is the improved prediction network. In order to enable the network to automatically learn the weight of the feature Channel, the invention introduces an efficiency Channel Attention network (ECA-Net) mechanism of lightweight Channel domain without dimensionality reduction, as shown in FIG. 3.

In FIG. 3, the input feature map X ∈ R^H×W×CAnd X has C characteristic channels. Generally, the convolutional neural network can only learn the local receptive field, and cannot utilize the context information outside the region. For this purpose, global spatial information is compressed by global average pooling, i.e. compressed in the spatial dimension H × W, to obtain 1 × 1 weight information, where the global average pooling formula is as follows:

where Y is the weight obtained after compression, and H × W is the spatial dimension information.

In order for the network to automatically learn attention weights of different channels, cross-channel information interaction is accomplished using one-dimensional convolution. The size of the one-dimensional convolution kernel is adaptively determined by a function of the channel dimension C, and a formula for calculating the size of the one-dimensional convolution kernel is as follows:

the resulting convolution kernel is used for one-dimensional convolution and Sigmoid is used to obtain the weight of each channel. The formula is as follows:

ω_c＝σ(C1D_k(y)) (4)

where σ is the Sigmoid activation function, ω_cIs the generated channel attention weight with dimensions of 1 × 1 × C. Then, the attention weight and the input feature map are weighted to realize the importance expression of the feature map channel, and the weighting formula is as follows:

wherein the content of the first and second substances,

denotes element-by-element multiplication, X_cIndicating the output result by the attention mechanism.

As shown in fig. 4, the feature map sizes of the two detection scales output are 13 × 13 and 26 × 26, respectively, i.e., the input image is divided into 13 × 13 and 26 × 26 grids, which respectively detect pedestrians at long distances and short distances, and each grid corresponds to a channel. Each grid is preset with 3 preselection frames, and the preselection frames are continuously adjusted during training and selected as output results. The different channels represent the output parameters of each grid, and the parameters of each channel include the center coordinates (b) of the prediction box, taking a 13 × 13 feature map as an example_x,b_y) Length and width of prediction frame (b)_w，b_h) Confidence score p of prediction box₀And a predicted score s of the pedestrian. Each grid contains 3 prediction frames, each grid contains the above 6 parameters, so that a feature map is outputThe channel dimensions of (a) are all 18. The present invention combines the ECA attention module with the prediction network of the Tiny Yoloov 3, and adds the ECA attention module and the prediction network to two detection scales respectively. An ECA attention module is added to a prediction network outputting a 13 x 13 feature map, the feature map passing through the attention module is subjected to up-sampling and is connected with a 26 x 26 feature map in series, a 384-dimensional channel feature map is output, the ECA attention module is used for redistributing weights, and finally two output layers focus on more pedestrian information, so that the influence of interference information and shielding problems is effectively reduced.

S4: modified loss and activation functions:

during the training process, the Loss function of Tiny YOLOV3 can be divided into three parts, namely, bounding box regression Loss, confidence Loss and classification Loss, and the total Loss can be expressed by equation (6):

where i represents a scale.

The positioning of pedestrian detection usually depends on accurate bounding box regression, and in order to improve the positioning accuracy and detection precision, the bounding box regression loss is optimized and improved. The invention adopts Generalized Intersection Over Union (GIOU) as the regression loss. The reason for adopting the GIOU has two aspects, that is, when an Intersection Over Unit (IOU) is in a condition that a real frame and a prediction frame do not intersect, the IOU cannot perform evaluation measurement; secondly, the IOU cannot accurately reflect the overlapping degree of the real frame and the predicted frame. IOU and GIOU are defined as follows:

wherein B represents a prediction box, B_gtRepresenting real boxes, C representing packagesThe minimum closed surface containing the real box and the prediction box.

The activation function is an important unit of the convolutional neural network, and is updated quickly along with the gradual maturity of a network model, so that the network can introduce nonlinear factors, the model is not single any more, the model is more complicated, and the network can learn better. The feature extraction network of the Tiny YOLOV3 adopts a LeakyRelu activation function, the method replaces the LeakyRelu activation function with a Mish activation function, and the function image is shown in figure 5. The Mish activation function is smoother, so that the network can learn pedestrian information better, and meanwhile, the Mish activation function allows a smaller negative gradient to flow in, so that information is not interrupted, and better accuracy and generalization capability are obtained.

Example 2:

the experimental environment configuration of the present invention is shown in table 1. The experiment is written in python 3.6 language, and the deep learning frame is Pytrch 1.4. Training batch set to 300, mini-batch set to 16, initial learning rate of 0.01, weight attenuation coefficient of 0.0005, momentum coefficient of 0.9. The invention adopts a multi-scale training mode, and images of each batch are randomly selected in (320, 352, 384, 416, 448, 480, 512, 544, 576, 608 and 640) so as to improve the generalization capability of the model.

Table 1 experimental environment configuration

The experimental data set used the VOC2007 and INRIA data sets. The VOC2007 data set contained 20 classes of targets, totaling 9963 images. According to the invention, all pedestrian images are extracted from a VOC2007 data set, the total number of the pedestrian images is 4015, the background of the data set is complex, the change of the pedestrian posture is large, different degrees of sheltering exist, the generalization capability of a training model can be enhanced, and the data set adopts 8: 2 into training and test sets. Most pedestrians in the INRIA data set are in a standing posture and are close to a real road scene, and the training set and the testing set are divided. The number of pedestrian images in the data set is shown in table 2.

TABLE 2 pedestrian data set image number

Results and analysis of the experiments

To evaluate the effectiveness of the improved algorithm, YOLOV3, Tiny YOLOV3 and the present invention were trained and tested in the VOC2007 and INRIA datasets, respectively. Before training, in order to make the pre-selected frame more fit to the pedestrian, the K-means clustering algorithm is used to re-select the initial pre-selected frame to obtain 6 pre-selected frame sizes, wherein (38,97), (81,202), (126,386) corresponds to 13 × 13 prediction layers, and (203,271), (251,473), (448,521) corresponds to 26 × 26 prediction layers.

The test indexes comprise Precision (Precision) and Recall (Recall), the accuracy of the detection algorithm is measured by adopting an Average Precision (AP) of the comprehensive indexes, and the detection speed is measured by adopting a Frame Per Second (FPS). To get the best model trained, each batch of training is completed, the test set is used for testing, and the model with the highest AP is saved. Fig. 6(a) and 6(b) are the accuracy variations of the present invention trained on VOC2007 and INRIA datasets, respectively.

Table 3 shows the model size and parameter values for different algorithms, the model size of the present invention is 39.8MB, which is only 6.6MB larger than Tiny YOLOV3, and the model size and parameter values are much smaller than YOLOV3, which has certain advantages in model size and parameter values.

TABLE 3 model sizes and parameter quantities for the algorithms

Table 4 shows the results of the training tests of each algorithm on two data sets, and the accuracy and recall ratio of the present invention are improved compared to the Tiny YOLOV 3. The pedestrian detection accuracy on the VOC data set is 77%, which is improved by 8.5% compared with the Tiny Yolov3, and although the pedestrian detection accuracy does not reach the detection accuracy of Yolov3, the detection speed reaches 92.6 frames per second, which is improved by 77.1% compared with Yolov 3. The accuracy on an INRIA data set is 92.7 percent, is improved by 2.5 percent compared with the Tiny YOLOV3, is only 0.2 percent lower than the YOLOV3 algorithm, is equivalent to the detection precision of the literature, but is superior to the detection precision of the literature on the detection speed, and the detection speed of the invention reaches 31.2 frames per second, thereby meeting the real-time detection requirement.

TABLE 4 comparison of the results of the experiments for each algorithm

FIGS. 7 and 8 are a comparison of the results of the detection of the present invention with the Tiny YOLOV3, respectively. Two pedestrian objects were missed in FIG. 7(a), and no pedestrian was missed in FIG. 8 (a); fig. 7(b) and fig. 8(b) show pedestrian detection in a crowded scene, where the missing detection of Tiny YOLOV3 is more serious, and the present invention is improved significantly; fig. 7(c) shows missing detection of pedestrian objects of small size on the left, and no missing detection in fig. 8 (c). The pedestrian detection method and the pedestrian detection device have the advantages that the better pedestrian detection effect is achieved, and the good detection effect can be achieved in the crowded scene and the detection of small targets, so that the pedestrian detection method and the pedestrian detection device have good generalization capability and can be used for accurately detecting pedestrians.

The invention provides a pedestrian detection algorithm integrated with an attention mechanism on the basis of the Tiny Yolov3, improves the characteristic extraction capability of pedestrian information by deepening the network, reduces the parameter quantity and the model size by 1 multiplied by 1 convolution, and ensures the speed of pedestrian detection. Meanwhile, a non-dimensionality-reduction lightweight channel attention mechanism is introduced into the prediction network, and weights of different channels are redistributed, so that the pedestrian information is more concerned by the model. In addition, the detection precision is further improved by optimizing the regression loss function and the activation function of the bounding box. The detection accuracy rates of 77% and 92.7% are obtained on the VOC2007 pedestrian subset and the INRIA data set, the accuracy rate and the recall rate are improved compared with the accuracy rate and the recall rate of the Tiny YOLOV3, and the detection speeds respectively reach 92.6 frames and 31.2 frames per second, so that the model has good robustness under different data sets and meets the real-time detection requirement. The invention has the speed advantage while keeping higher detection accuracy, but in the face of the conditions of larger change of the posture of the pedestrian and more serious shielding, the accuracy is still different from that of a complex large-scale network, and the detection accuracy is further improved under the condition of meeting the real-time detection in the following work.

The invention can be widely applied to target detection occasions.

It is to be noted that, in the present invention, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A real-time pedestrian detection method fused with an attention mechanism is characterized by comprising the following steps:

s1: selecting a Tiny Yoloov 3 algorithm, comprising the following steps:

wherein P (object) is the existence probability of the object in the grid,

the intersection ratio of the prediction frame and the real frame is obtained;

s2: deepening the feature extraction network, comprising the following steps:

ω_c＝σ(C1D_k(y)) (4)

wherein the content of the first and second substances,

wherein i represents a scale;

2. The method for real-time pedestrian detection with attention fused mechanism according to claim 1, wherein in step S1, the YOLO series algorithm is a one-stage target detection algorithm based on convolutional neural network, and Tiny YOLOV3 is a simplified version based on YOLOV 3.

3. The method for detecting pedestrians in real time with the integrated attention mechanism as claimed in claim 1, wherein in step S2, the feature extraction network of Tiny YOLOV3 is shallow, it is difficult to extract deep features, and the accuracy is low in detecting pedestrian targets; under the condition of overlarge calculated amount, by taking the idea of dense connection network as reference, before the increased 3 × 3 convolutional layer, the convolutional layer with the convolutional kernel size of 1 × 1 is introduced, and the channel dimension is reduced so as to reduce the calculated amount of the network.

4. The method for detecting the pedestrian in real time with the integrated attention mechanism according to claim 1, wherein in step S3, in an actual pedestrian detection scene, the existence of the interference and the occlusion condition of the background information affects the extraction of the pedestrian feature by the network, and further affects the pedestrian detection accuracy; the prediction network of the Tiny YOLOV3 fuses feature maps of two scales, and the fusion mode only connects the features in series in the channel dimension and cannot reflect the importance degree of the pedestrian features on certain channels.

5. The method for real-time pedestrian detection with attention fused mechanism according to claim 1, wherein in step S32, the convolutional neural network can only learn local receptive fields, and cannot utilize context information outside the region.

6. The method for detecting pedestrians in real time with attention fused mechanism as claimed in claim 1, wherein in step S12, the feature map output by the two detection scales is 13 × 13 and 26 × 26 in size, i.e. the input image is divided into 13 × 13 and 26 × 26 grids, and pedestrians at far distance and near distance are detected respectively, each grid corresponding to one channel.

7. The method for detecting pedestrians in real time according to claim 6, wherein in step S12, each grid is preset with 3 preselection frames, and the preselection frames are adjusted continuously during training to select the optimal preselection frame as the output result; the different channels represent output parameters of each grid, taking a 13 × 13 feature map as an example, the parameters of each channel include the center coordinates (bx, by) of the prediction frame, the length and width (bw, bh) of the prediction frame, the confidence score p0 of the prediction frame, and the prediction score s of the pedestrian; each grid contains 3 prediction frames and each grid contains 6 parameters, so that the channel dimensions of the output feature map are all 18.

8. The method for detecting pedestrians in real time with fusion of attention mechanism as claimed in claim 7, wherein in step S12, an ECA attention module is added to the prediction network outputting 13 × 13 feature map, the feature map after passing through the attention module is up-sampled and connected in series with 26 × 26 feature map, feature map of 384-dimensional channel is output, and then weight is redistributed by ECA attention module, and finally two output layers will pay more attention to pedestrian information, which effectively reduces the influence of interference information and occlusion problem.