CN113673593A - Multi-scale feature fusion pedestrian detection method based on attention mechanism - Google Patents
Multi-scale feature fusion pedestrian detection method based on attention mechanism Download PDFInfo
- Publication number
- CN113673593A CN113673593A CN202110941462.3A CN202110941462A CN113673593A CN 113673593 A CN113673593 A CN 113673593A CN 202110941462 A CN202110941462 A CN 202110941462A CN 113673593 A CN113673593 A CN 113673593A
- Authority
- CN
- China
- Prior art keywords
- attention
- model
- map
- pedestrian detection
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 40
- 230000007246 mechanism Effects 0.000 title claims abstract description 21
- 230000004927 fusion Effects 0.000 title claims abstract description 14
- 238000012549 training Methods 0.000 claims abstract description 8
- 238000012795 verification Methods 0.000 claims abstract description 4
- 238000011176 pooling Methods 0.000 claims description 8
- 238000010586 diagram Methods 0.000 claims description 7
- 239000013598 vector Substances 0.000 claims description 5
- 238000004422 calculation algorithm Methods 0.000 abstract description 4
- 238000000034 method Methods 0.000 description 14
- 238000013527 convolutional neural network Methods 0.000 description 6
- 230000000694 effects Effects 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000009313 farming Methods 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000012821 model calculation Methods 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000011897 real-time detection Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Traffic Control Systems (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a multi-scale feature fusion pedestrian detection method based on an attention mechanism, which comprises the following steps of: inputting a training set and a verification set, extracting pedestrian features and generating a feature map; inputting a network model and training the model; and if the batch reaches the specified batch, outputting the model and verifying the model. The attention mechanism-based multi-scale feature fusion pedestrian detection method applies the FCOS algorithm to pedestrian detection, adopts a dense pyramid structure on the basis, fuses top-layer features and bottom-layer features, enables the fused features to have space information of the bottom layer and detail information of the top-layer features, and can better identify pedestrian targets. Secondly, the fused features are fused into space and channel attention, so that the pedestrian targets can be accurately positioned.
Description
Technical Field
The invention belongs to the technical field of pedestrian detection, and particularly relates to a multi-scale feature fusion pedestrian detection method based on an attention mechanism.
Background
The scientific development of China in the twenty-first century is rapidly advanced, and great convenience is provided for daily life of people. More and more complicated work can be completed by a computer, but not every task and work need to be performed by people and be done personally; in the past, every job was based on human labor, such as farming, which was both time and financial and labor intensive. Nowadays, however, due to the existence of computer big data, the quality result is obtained, meanwhile, public resources are not occupied, and a great amount of time is saved. Computer vision is a subject field in which researchers in large part need to research, and is applied to a plurality of subjects such as pattern recognition, image processing, machine learning and the like.
The first-order full convolution network (FCOS) realizes the ideas of no proposal and no anchor frame and provides the idea of centrality, and the weight of the boundary frame far away from the center can be reduced, so that the boundary frame with low quality is restrained, and the detection quality is improved. And thus becomes a common model for pedestrian detection. With the rapid development of the pedestrian detection technology, in order to better detect pedestrians and better learn the performance of the model, the intensive pyramid structure and the attention mechanism are added to the existing various network structures to show great improvement potential.
The CNN is used for extracting feature information of the image under various resolutions, and features of a lower layer have rich spatial information and are suitable for detecting small targets. On the contrary, the top-level features have rich semantic information and are suitable for detecting large targets. Therefore, the invention changes the characteristic pyramid into dense connection at the CNN stage, so that the fused characteristics have stronger semantic information and spatial information. In addition, an effective attention mechanism is the key of the improved model to the pedestrian detection precision. The existing single attention mechanism applied to the model does not work well sometimes, so that a CBAM module is needed, and the CBAM module is a simple and effective feedforward convolutional neural network attention module and can be seamlessly integrated into any CNN architecture. Therefore, the technology of applying space and channel attention to the deep neural network is a significant research for optimizing the performance of the FCOS model.
In 2019 of Yuanbeijiang and the like, a method for detecting small-scale dense pedestrians is provided, by setting aspect ratios with different proportions, intersection and parallel ratios of the real frames of the pedestrians are calculated, the real frames are marked as negative samples when the ratio is less than 0.3, and the real frames are marked as positive samples when the ratio is more than 0.7. Feature maps with different semantics are obtained from different stages of a network model, feature maps with 3 stages are obtained, the feature maps with different semantics are matched with targets with different scales, a high-semantic feature map is matched with a large-scale target, a medium-semantic feature map is matched with a medium-scale target, a low-semantic feature map is matched with a small-scale target, meanwhile, a cascade form is adopted among the different semantic feature maps, the low-semantic feature map is upwards superposed on the feature map with a high-level semantic, and finally, a mask is output through a plurality of convolution layers to predict the feature map, loss is generated with Gaussian mask labeling, and therefore the detection performance of the small-scale target is improved.
In 2020, white xiaying et al proposed a pedestrian detection model based on the attention mechanism, which uses the body frame of YOLOv3, incorporating the channel attention and spatial attention mechanisms for residual joining. Using the structure of YOLOv3 allows for more abundant features to be extracted. According to the invention, the conditions of missing detection and false detection in the end-to-end pedestrian target detection algorithm are reduced according to the real-time detection performance of the YOLOv3 algorithm, the extracted feature vectors are fully utilized and corrected, and meanwhile, a single connection mode in a residual connection structure is modified, so that the feature vectors beneficial to subsequent detection can be better screened out by the whole network.
At present, the conventional pedestrian detection method has been slowly replaced by the deep learning method. Conventional pedestrian detection methods use artificially designed features that are difficult to adapt to changes in human body morphology. Researchers find that features expressed based on deep learning have strong hierarchical expression capability and good robustness, and can better solve some visual problems.
Because deep convolution has a good effect in pedestrian detection, more and more people improve a target detection framework and apply the target detection framework to pedestrian detection. However, most frames are anchor-based methods, and for some scenes with fewer pedestrians, the anchor-based methods consume a lot of time and space to generate a lot of invalid anchor frames without pedestrians. In order to enable the model to focus better on pedestrians, spatial attention mechanisms and channel attention mechanisms have shown great potential. However, most existing methods develop a more complicated attention module as a basic task for achieving better classification effect, and meanwhile, the complexity of calculation is continuously increased.
The pedestrian detection methods provided above all have certain effects, but when comparing with real frames of pedestrians, anchor frames with too many length-width ratios are set, and in scenes with few pedestrians, such an anchor frame mechanism can enable a model to generate a large number of anchor frames without targets. This increases the computational load of the model while consuming a lot of space.
Disclosure of Invention
Based on the defects of the prior art, the technical problem solved by the invention is to provide a multi-scale feature fusion pedestrian detection method based on an attention mechanism, which can greatly reduce the calculation time of a model, change the fusion mode of a feature pyramid, enable the features of each layer to be fused more fully, and further integrate a CBAM module on the basis, so that the pedestrian position can be focused more directly, and better and more accurate detection can be performed.
In order to solve the technical problem, the invention provides a multi-scale feature fusion pedestrian detection method based on an attention mechanism, which comprises the following steps of:
step 1: inputting a training set and a verification set, extracting pedestrian features and generating a feature map;
step 2: inputting a network model and training the model;
and step 3: and if the batch reaches the specified batch, outputting the model and verifying the model.
Optionally, in step 1, calculating the input feature map through a channel attention module to obtain a unique channel attention map, and multiplying the input feature map and the channel attention map at a pixel level to obtain a salient feature map; and then, calculating a two-dimensional spatial attention diagram through a spatial attention module, and finally multiplying the salient feature map and the pixel level of the spatial attention diagram to output a final salient feature map.
Optionally, the channel attention first uses maximum pooling and average pooling for the feature maps respectively, generates two different spatial content description feature maps respectively, sends the two spatial content description feature maps to a shared network simultaneously to calculate to obtain a channel attention map, and finally fuses output vectors of the shared network in a point-by-point pixel bitwise addition manner.
From the above, the attention mechanism-based multi-scale feature fusion pedestrian detection method mainly solves the problems of low target detection precision and missing detection under the conditions that the size of pedestrians is small and objects are shielded in a scene. The FCOS algorithm is applied to pedestrian detection, a dense pyramid structure is adopted on the basis, top-layer features and bottom-layer features are fused, so that the fused features have bottom-layer spatial information and top-layer feature detail information, and a travel person target can be better identified. Secondly, the fused features are fused into space and channel attention, so that the pedestrian targets can be accurately positioned.
The foregoing description is only an overview of the technical solutions of the present invention, and in order to make the technical means of the present invention more clearly understood, the present invention may be implemented in accordance with the content of the description, and in order to make the above and other objects, features, and advantages of the present invention more clearly understood, the following detailed description is given in conjunction with the preferred embodiments, together with the accompanying drawings.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings of the embodiments will be briefly described below.
FIG. 1 is a flowchart of a multi-scale feature fusion pedestrian detection method based on an attention mechanism according to the present invention.
Detailed Description
Other aspects, features and advantages of the present invention will become apparent from the following detailed description, taken in conjunction with the accompanying drawings, which form a part of this specification, and which illustrate, by way of example, the principles of the invention. In the referenced drawings, the same or similar components in different drawings are denoted by the same reference numerals.
As shown in FIG. 1, the multi-scale feature fusion pedestrian detection method based on the attention mechanism comprises the following steps:
step 1: inputting a training set and a verification set, extracting pedestrian features and generating a feature map;
step 2: inputting a network model and training the model;
and step 3: and if the batch reaches the specified batch, outputting the model and verifying the model.
Given a middle layer feature map F e RC H W of CNN as the input feature map of CBAM, the number of channels is C, and the width and height of each channel feature map are W and H, respectively. The CBAM firstly calculates the input feature map F through a channel attention module to obtain a unique channel attention map AcE is RC 1, F and A are addedcObtaining a significant feature map F after pixel-level multiplicationcE.g. RC H W; then FcTwo-dimensional space attention diagram A calculated by a space attention modulesE R1H W, finally FcAnd AsOutputting final significant characteristic graph F after pixel-level multiplicationR∈RC*H*W。
Wherein the channel attention first uses maximum pooling and average pooling respectively for the feature map F to generate two different spatial content description feature maps respectively:andsimultaneously sending the two space content description characteristic diagrams into a shared network to calculate a channel attention diagram Fc. The shared network is composed of a multilayer perceptron with a hidden layer, and finally the output vectors of the shared network are fused to obtain F by using a mode of adding point pixels bit by bitc。
Similarly, the Spatial Attention Module (Spatial Attention Module) compresses channels, and performs mean pooling and maximum pooling, respectively, in channel dimensions. The operation of the MaxPool is to extract the maximum value on a channel, and the extraction times are height multiplied by width; the AvgPool operation is to extract an average value on a channel, and the extraction times are also height times width; then, the feature maps (the number of channels is all 1) extracted previously are combined to obtain a feature map of 2 channels.
Secondly, the CBAM attention module is fused into a dense feature pyramid, so that the feature map can be corrected layer by layer. In addition to performing horizontal connection and upsampling connection on the feature pyramids { P2, P3, P4 and P5}, the top-level features are connected to the lower-level features, so that the fused features not only have semantic information respectively possessed by a plurality of top-level features, but also have spatial information of the lower-level features. Therefore, the detection precision can be better improved. However, redundant features still exist in the features after dense connection, so that the features are fused into CBAM to obtain the significant feature maps { A2, A3, A4 and A5} of each layer, and finally, the transverse connection and the dense connection adopted by the feature maps of each layer are connected.
In the embodiment, the results of the comparative experiment between the method of the present invention and the two-stage model FasterR-CNN, the single-stage model Retianet and the latest method are also shown in Table 1. The four methods all use the same Caltech pedestrian data set, so that the improvement of the pedestrian detection precision by the method can be verified. In addition, the present invention uses the average Accuracy (AP) as an evaluation index.
Table 1 comparison of detection performance on Caltech dataset by different methods
The invention adopts a one-stage full convolution model, the model adopts a method without an anchor-free frame, each coordinate point is directly predicted on the basis of FCN, the distance of the upper, lower, left and right sides of the corresponding target is detected, and thus the number of detected positive samples is increased. Compared with the existing method, on one hand, the method avoids the problem that the time of model calculation is increased due to the fact that a large number of target frames are generated in the anchor frame-based method, and on the other hand, anchors are difficult to cover all targets, so that the generalization capability is poor. On the other hand, because the dense connection and the CBAM attention module are integrated, the extracted features can better represent pedestrians, the pedestrian area can be rapidly and accurately positioned, and the detection precision is improved.
While the foregoing is directed to the preferred embodiment of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow.
Claims (3)
1. A multi-scale feature fusion pedestrian detection method based on an attention mechanism is characterized by comprising the following steps:
step 1: inputting a training set and a verification set, extracting pedestrian features and generating a feature map;
step 2: inputting a network model and training the model;
and step 3: and if the batch reaches the specified batch, outputting the model and verifying the model.
2. The pedestrian detection method based on the attention mechanism and multi-scale feature fusion of the claim 1 is characterized in that in the step 1, the input feature map is calculated through a channel attention module to obtain a unique channel attention map, and then the input feature map and the channel attention map are multiplied through a pixel level to obtain a significant feature map; and then, calculating a two-dimensional spatial attention diagram through a spatial attention module, and finally multiplying the salient feature map and the pixel level of the spatial attention diagram to output a final salient feature map.
3. The pedestrian detection method based on attention mechanism and multi-scale feature fusion of claim 2, wherein the channel attention is obtained by firstly respectively using maximum pooling and average pooling on the feature maps to respectively generate two different spatial content description feature maps, simultaneously sending the two spatial content description feature maps to a shared network to calculate the channel attention map, and finally fusing output vectors of the shared network in a point-by-point pixel bitwise addition mode.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110941462.3A CN113673593A (en) | 2021-08-17 | 2021-08-17 | Multi-scale feature fusion pedestrian detection method based on attention mechanism |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110941462.3A CN113673593A (en) | 2021-08-17 | 2021-08-17 | Multi-scale feature fusion pedestrian detection method based on attention mechanism |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113673593A true CN113673593A (en) | 2021-11-19 |
Family
ID=78543201
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110941462.3A Pending CN113673593A (en) | 2021-08-17 | 2021-08-17 | Multi-scale feature fusion pedestrian detection method based on attention mechanism |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113673593A (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110084210A (en) * | 2019-04-30 | 2019-08-02 | 电子科技大学 | The multiple dimensioned Ship Detection of SAR image based on attention pyramid network |
CN110705457A (en) * | 2019-09-29 | 2020-01-17 | 核工业北京地质研究院 | Remote sensing image building change detection method |
CN111767882A (en) * | 2020-07-06 | 2020-10-13 | 江南大学 | Multi-mode pedestrian detection method based on improved YOLO model |
-
2021
- 2021-08-17 CN CN202110941462.3A patent/CN113673593A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110084210A (en) * | 2019-04-30 | 2019-08-02 | 电子科技大学 | The multiple dimensioned Ship Detection of SAR image based on attention pyramid network |
CN110705457A (en) * | 2019-09-29 | 2020-01-17 | 核工业北京地质研究院 | Remote sensing image building change detection method |
CN111767882A (en) * | 2020-07-06 | 2020-10-13 | 江南大学 | Multi-mode pedestrian detection method based on improved YOLO model |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111126472B (en) | SSD (solid State disk) -based improved target detection method | |
CN109241982B (en) | Target detection method based on deep and shallow layer convolutional neural network | |
CN111259786B (en) | Pedestrian re-identification method based on synchronous enhancement of appearance and motion information of video | |
CN111126202A (en) | Optical remote sensing image target detection method based on void feature pyramid network | |
WO2021218786A1 (en) | Data processing system, object detection method and apparatus thereof | |
CN111461083A (en) | Rapid vehicle detection method based on deep learning | |
CN112070044B (en) | Video object classification method and device | |
CN113609896A (en) | Object-level remote sensing change detection method and system based on dual-correlation attention | |
CN111353544B (en) | Improved Mixed Pooling-YOLOV 3-based target detection method | |
CN112801027A (en) | Vehicle target detection method based on event camera | |
CN111008633A (en) | License plate character segmentation method based on attention mechanism | |
CN113920468B (en) | Multi-branch pedestrian detection method based on cross-scale feature enhancement | |
Liu et al. | Extended faster R-CNN for long distance human detection: Finding pedestrians in UAV images | |
CN112036260A (en) | Expression recognition method and system for multi-scale sub-block aggregation in natural environment | |
CN113297959A (en) | Target tracking method and system based on corner attention twin network | |
Wang et al. | TF-SOD: a novel transformer framework for salient object detection | |
Wang et al. | Global contextual guided residual attention network for salient object detection | |
CN110852199A (en) | Foreground extraction method based on double-frame coding and decoding model | |
CN114913604A (en) | Attitude identification method based on two-stage pooling S2E module | |
Wei et al. | Bidirectional attentional interaction networks for rgb-d salient object detection | |
Wang et al. | Summary of object detection based on convolutional neural network | |
CN114758285B (en) | Video interaction action detection method based on anchor freedom and long-term attention perception | |
CN113673593A (en) | Multi-scale feature fusion pedestrian detection method based on attention mechanism | |
CN113076902B (en) | Multitasking fusion character fine granularity segmentation system and method | |
CN114998879A (en) | Fuzzy license plate recognition method based on event camera |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |