CN113673593A

CN113673593A - Multi-scale feature fusion pedestrian detection method based on attention mechanism

Info

Publication number: CN113673593A
Application number: CN202110941462.3A
Authority: CN
Inventors: 曲海成; 夏明豪
Original assignee: Liaoning Technical University
Current assignee: Liaoning Technical University
Priority date: 2021-08-17
Filing date: 2021-08-17
Publication date: 2021-11-19

Abstract

The invention discloses a multi-scale feature fusion pedestrian detection method based on an attention mechanism, which comprises the following steps of: inputting a training set and a verification set, extracting pedestrian features and generating a feature map; inputting a network model and training the model; and if the batch reaches the specified batch, outputting the model and verifying the model. The attention mechanism-based multi-scale feature fusion pedestrian detection method applies the FCOS algorithm to pedestrian detection, adopts a dense pyramid structure on the basis, fuses top-layer features and bottom-layer features, enables the fused features to have space information of the bottom layer and detail information of the top-layer features, and can better identify pedestrian targets. Secondly, the fused features are fused into space and channel attention, so that the pedestrian targets can be accurately positioned.

Description

Multi-scale feature fusion pedestrian detection method based on attention mechanism

Technical Field

The invention belongs to the technical field of pedestrian detection, and particularly relates to a multi-scale feature fusion pedestrian detection method based on an attention mechanism.

Background

The scientific development of China in the twenty-first century is rapidly advanced, and great convenience is provided for daily life of people. More and more complicated work can be completed by a computer, but not every task and work need to be performed by people and be done personally; in the past, every job was based on human labor, such as farming, which was both time and financial and labor intensive. Nowadays, however, due to the existence of computer big data, the quality result is obtained, meanwhile, public resources are not occupied, and a great amount of time is saved. Computer vision is a subject field in which researchers in large part need to research, and is applied to a plurality of subjects such as pattern recognition, image processing, machine learning and the like.

The first-order full convolution network (FCOS) realizes the ideas of no proposal and no anchor frame and provides the idea of centrality, and the weight of the boundary frame far away from the center can be reduced, so that the boundary frame with low quality is restrained, and the detection quality is improved. And thus becomes a common model for pedestrian detection. With the rapid development of the pedestrian detection technology, in order to better detect pedestrians and better learn the performance of the model, the intensive pyramid structure and the attention mechanism are added to the existing various network structures to show great improvement potential.

The CNN is used for extracting feature information of the image under various resolutions, and features of a lower layer have rich spatial information and are suitable for detecting small targets. On the contrary, the top-level features have rich semantic information and are suitable for detecting large targets. Therefore, the invention changes the characteristic pyramid into dense connection at the CNN stage, so that the fused characteristics have stronger semantic information and spatial information. In addition, an effective attention mechanism is the key of the improved model to the pedestrian detection precision. The existing single attention mechanism applied to the model does not work well sometimes, so that a CBAM module is needed, and the CBAM module is a simple and effective feedforward convolutional neural network attention module and can be seamlessly integrated into any CNN architecture. Therefore, the technology of applying space and channel attention to the deep neural network is a significant research for optimizing the performance of the FCOS model.

In 2019 of Yuanbeijiang and the like, a method for detecting small-scale dense pedestrians is provided, by setting aspect ratios with different proportions, intersection and parallel ratios of the real frames of the pedestrians are calculated, the real frames are marked as negative samples when the ratio is less than 0.3, and the real frames are marked as positive samples when the ratio is more than 0.7. Feature maps with different semantics are obtained from different stages of a network model, feature maps with 3 stages are obtained, the feature maps with different semantics are matched with targets with different scales, a high-semantic feature map is matched with a large-scale target, a medium-semantic feature map is matched with a medium-scale target, a low-semantic feature map is matched with a small-scale target, meanwhile, a cascade form is adopted among the different semantic feature maps, the low-semantic feature map is upwards superposed on the feature map with a high-level semantic, and finally, a mask is output through a plurality of convolution layers to predict the feature map, loss is generated with Gaussian mask labeling, and therefore the detection performance of the small-scale target is improved.

In 2020, white xiaying et al proposed a pedestrian detection model based on the attention mechanism, which uses the body frame of YOLOv3, incorporating the channel attention and spatial attention mechanisms for residual joining. Using the structure of YOLOv3 allows for more abundant features to be extracted. According to the invention, the conditions of missing detection and false detection in the end-to-end pedestrian target detection algorithm are reduced according to the real-time detection performance of the YOLOv3 algorithm, the extracted feature vectors are fully utilized and corrected, and meanwhile, a single connection mode in a residual connection structure is modified, so that the feature vectors beneficial to subsequent detection can be better screened out by the whole network.

At present, the conventional pedestrian detection method has been slowly replaced by the deep learning method. Conventional pedestrian detection methods use artificially designed features that are difficult to adapt to changes in human body morphology. Researchers find that features expressed based on deep learning have strong hierarchical expression capability and good robustness, and can better solve some visual problems.

Because deep convolution has a good effect in pedestrian detection, more and more people improve a target detection framework and apply the target detection framework to pedestrian detection. However, most frames are anchor-based methods, and for some scenes with fewer pedestrians, the anchor-based methods consume a lot of time and space to generate a lot of invalid anchor frames without pedestrians. In order to enable the model to focus better on pedestrians, spatial attention mechanisms and channel attention mechanisms have shown great potential. However, most existing methods develop a more complicated attention module as a basic task for achieving better classification effect, and meanwhile, the complexity of calculation is continuously increased.

The pedestrian detection methods provided above all have certain effects, but when comparing with real frames of pedestrians, anchor frames with too many length-width ratios are set, and in scenes with few pedestrians, such an anchor frame mechanism can enable a model to generate a large number of anchor frames without targets. This increases the computational load of the model while consuming a lot of space.

Disclosure of Invention

Based on the defects of the prior art, the technical problem solved by the invention is to provide a multi-scale feature fusion pedestrian detection method based on an attention mechanism, which can greatly reduce the calculation time of a model, change the fusion mode of a feature pyramid, enable the features of each layer to be fused more fully, and further integrate a CBAM module on the basis, so that the pedestrian position can be focused more directly, and better and more accurate detection can be performed.

In order to solve the technical problem, the invention provides a multi-scale feature fusion pedestrian detection method based on an attention mechanism, which comprises the following steps of:

step 1: inputting a training set and a verification set, extracting pedestrian features and generating a feature map;

step 2: inputting a network model and training the model;

and step 3: and if the batch reaches the specified batch, outputting the model and verifying the model.

Optionally, in step 1, calculating the input feature map through a channel attention module to obtain a unique channel attention map, and multiplying the input feature map and the channel attention map at a pixel level to obtain a salient feature map; and then, calculating a two-dimensional spatial attention diagram through a spatial attention module, and finally multiplying the salient feature map and the pixel level of the spatial attention diagram to output a final salient feature map.

Optionally, the channel attention first uses maximum pooling and average pooling for the feature maps respectively, generates two different spatial content description feature maps respectively, sends the two spatial content description feature maps to a shared network simultaneously to calculate to obtain a channel attention map, and finally fuses output vectors of the shared network in a point-by-point pixel bitwise addition manner.

From the above, the attention mechanism-based multi-scale feature fusion pedestrian detection method mainly solves the problems of low target detection precision and missing detection under the conditions that the size of pedestrians is small and objects are shielded in a scene. The FCOS algorithm is applied to pedestrian detection, a dense pyramid structure is adopted on the basis, top-layer features and bottom-layer features are fused, so that the fused features have bottom-layer spatial information and top-layer feature detail information, and a travel person target can be better identified. Secondly, the fused features are fused into space and channel attention, so that the pedestrian targets can be accurately positioned.

The foregoing description is only an overview of the technical solutions of the present invention, and in order to make the technical means of the present invention more clearly understood, the present invention may be implemented in accordance with the content of the description, and in order to make the above and other objects, features, and advantages of the present invention more clearly understood, the following detailed description is given in conjunction with the preferred embodiments, together with the accompanying drawings.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings of the embodiments will be briefly described below.

FIG. 1 is a flowchart of a multi-scale feature fusion pedestrian detection method based on an attention mechanism according to the present invention.

Detailed Description

Other aspects, features and advantages of the present invention will become apparent from the following detailed description, taken in conjunction with the accompanying drawings, which form a part of this specification, and which illustrate, by way of example, the principles of the invention. In the referenced drawings, the same or similar components in different drawings are denoted by the same reference numerals.

As shown in FIG. 1, the multi-scale feature fusion pedestrian detection method based on the attention mechanism comprises the following steps:

step 2: inputting a network model and training the model;

Given a middle layer feature map F e RC H W of CNN as the input feature map of CBAM, the number of channels is C, and the width and height of each channel feature map are W and H, respectively. The CBAM firstly calculates the input feature map F through a channel attention module to obtain a unique channel attention map A_cE is RC 1, F and A are added_cObtaining a significant feature map F after pixel-level multiplication_cE.g. RC H W; then F_cTwo-dimensional space attention diagram A calculated by a space attention module_sE R1H W, finally F_cAnd A_sOutputting final significant characteristic graph F after pixel-level multiplication_R∈RC*H*W。

Wherein the channel attention first uses maximum pooling and average pooling respectively for the feature map F to generate two different spatial content description feature maps respectively:

and

simultaneously sending the two space content description characteristic diagrams into a shared network to calculate a channel attention diagram F_c. The shared network is composed of a multilayer perceptron with a hidden layer, and finally the output vectors of the shared network are fused to obtain F by using a mode of adding point pixels bit by bit_c。

Similarly, the Spatial Attention Module (Spatial Attention Module) compresses channels, and performs mean pooling and maximum pooling, respectively, in channel dimensions. The operation of the MaxPool is to extract the maximum value on a channel, and the extraction times are height multiplied by width; the AvgPool operation is to extract an average value on a channel, and the extraction times are also height times width; then, the feature maps (the number of channels is all 1) extracted previously are combined to obtain a feature map of 2 channels.

Secondly, the CBAM attention module is fused into a dense feature pyramid, so that the feature map can be corrected layer by layer. In addition to performing horizontal connection and upsampling connection on the feature pyramids { P2, P3, P4 and P5}, the top-level features are connected to the lower-level features, so that the fused features not only have semantic information respectively possessed by a plurality of top-level features, but also have spatial information of the lower-level features. Therefore, the detection precision can be better improved. However, redundant features still exist in the features after dense connection, so that the features are fused into CBAM to obtain the significant feature maps { A2, A3, A4 and A5} of each layer, and finally, the transverse connection and the dense connection adopted by the feature maps of each layer are connected.

In the embodiment, the results of the comparative experiment between the method of the present invention and the two-stage model FasterR-CNN, the single-stage model Retianet and the latest method are also shown in Table 1. The four methods all use the same Caltech pedestrian data set, so that the improvement of the pedestrian detection precision by the method can be verified. In addition, the present invention uses the average Accuracy (AP) as an evaluation index.

Table 1 comparison of detection performance on Caltech dataset by different methods

The invention adopts a one-stage full convolution model, the model adopts a method without an anchor-free frame, each coordinate point is directly predicted on the basis of FCN, the distance of the upper, lower, left and right sides of the corresponding target is detected, and thus the number of detected positive samples is increased. Compared with the existing method, on one hand, the method avoids the problem that the time of model calculation is increased due to the fact that a large number of target frames are generated in the anchor frame-based method, and on the other hand, anchors are difficult to cover all targets, so that the generalization capability is poor. On the other hand, because the dense connection and the CBAM attention module are integrated, the extracted features can better represent pedestrians, the pedestrian area can be rapidly and accurately positioned, and the detection precision is improved.

While the foregoing is directed to the preferred embodiment of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow.

Claims

1. A multi-scale feature fusion pedestrian detection method based on an attention mechanism is characterized by comprising the following steps:

step 2: inputting a network model and training the model;

2. The pedestrian detection method based on the attention mechanism and multi-scale feature fusion of the claim 1 is characterized in that in the step 1, the input feature map is calculated through a channel attention module to obtain a unique channel attention map, and then the input feature map and the channel attention map are multiplied through a pixel level to obtain a significant feature map; and then, calculating a two-dimensional spatial attention diagram through a spatial attention module, and finally multiplying the salient feature map and the pixel level of the spatial attention diagram to output a final salient feature map.

3. The pedestrian detection method based on attention mechanism and multi-scale feature fusion of claim 2, wherein the channel attention is obtained by firstly respectively using maximum pooling and average pooling on the feature maps to respectively generate two different spatial content description feature maps, simultaneously sending the two spatial content description feature maps to a shared network to calculate the channel attention map, and finally fusing output vectors of the shared network in a point-by-point pixel bitwise addition mode.