CN113673593A - Multi-scale feature fusion pedestrian detection method based on attention mechanism - Google Patents

Multi-scale feature fusion pedestrian detection method based on attention mechanism Download PDF

Info

Publication number
CN113673593A
CN113673593A CN202110941462.3A CN202110941462A CN113673593A CN 113673593 A CN113673593 A CN 113673593A CN 202110941462 A CN202110941462 A CN 202110941462A CN 113673593 A CN113673593 A CN 113673593A
Authority
CN
China
Prior art keywords
attention
model
map
pedestrian detection
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110941462.3A
Other languages
Chinese (zh)
Inventor
曲海成
夏明豪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Liaoning Technical University
Original Assignee
Liaoning Technical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Liaoning Technical University filed Critical Liaoning Technical University
Priority to CN202110941462.3A priority Critical patent/CN113673593A/en
Publication of CN113673593A publication Critical patent/CN113673593A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Traffic Control Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a multi-scale feature fusion pedestrian detection method based on an attention mechanism, which comprises the following steps of: inputting a training set and a verification set, extracting pedestrian features and generating a feature map; inputting a network model and training the model; and if the batch reaches the specified batch, outputting the model and verifying the model. The attention mechanism-based multi-scale feature fusion pedestrian detection method applies the FCOS algorithm to pedestrian detection, adopts a dense pyramid structure on the basis, fuses top-layer features and bottom-layer features, enables the fused features to have space information of the bottom layer and detail information of the top-layer features, and can better identify pedestrian targets. Secondly, the fused features are fused into space and channel attention, so that the pedestrian targets can be accurately positioned.

Description

Multi-scale feature fusion pedestrian detection method based on attention mechanism
Technical Field
The invention belongs to the technical field of pedestrian detection, and particularly relates to a multi-scale feature fusion pedestrian detection method based on an attention mechanism.
Background
The scientific development of China in the twenty-first century is rapidly advanced, and great convenience is provided for daily life of people. More and more complicated work can be completed by a computer, but not every task and work need to be performed by people and be done personally; in the past, every job was based on human labor, such as farming, which was both time and financial and labor intensive. Nowadays, however, due to the existence of computer big data, the quality result is obtained, meanwhile, public resources are not occupied, and a great amount of time is saved. Computer vision is a subject field in which researchers in large part need to research, and is applied to a plurality of subjects such as pattern recognition, image processing, machine learning and the like.
The first-order full convolution network (FCOS) realizes the ideas of no proposal and no anchor frame and provides the idea of centrality, and the weight of the boundary frame far away from the center can be reduced, so that the boundary frame with low quality is restrained, and the detection quality is improved. And thus becomes a common model for pedestrian detection. With the rapid development of the pedestrian detection technology, in order to better detect pedestrians and better learn the performance of the model, the intensive pyramid structure and the attention mechanism are added to the existing various network structures to show great improvement potential.
The CNN is used for extracting feature information of the image under various resolutions, and features of a lower layer have rich spatial information and are suitable for detecting small targets. On the contrary, the top-level features have rich semantic information and are suitable for detecting large targets. Therefore, the invention changes the characteristic pyramid into dense connection at the CNN stage, so that the fused characteristics have stronger semantic information and spatial information. In addition, an effective attention mechanism is the key of the improved model to the pedestrian detection precision. The existing single attention mechanism applied to the model does not work well sometimes, so that a CBAM module is needed, and the CBAM module is a simple and effective feedforward convolutional neural network attention module and can be seamlessly integrated into any CNN architecture. Therefore, the technology of applying space and channel attention to the deep neural network is a significant research for optimizing the performance of the FCOS model.
In 2019 of Yuanbeijiang and the like, a method for detecting small-scale dense pedestrians is provided, by setting aspect ratios with different proportions, intersection and parallel ratios of the real frames of the pedestrians are calculated, the real frames are marked as negative samples when the ratio is less than 0.3, and the real frames are marked as positive samples when the ratio is more than 0.7. Feature maps with different semantics are obtained from different stages of a network model, feature maps with 3 stages are obtained, the feature maps with different semantics are matched with targets with different scales, a high-semantic feature map is matched with a large-scale target, a medium-semantic feature map is matched with a medium-scale target, a low-semantic feature map is matched with a small-scale target, meanwhile, a cascade form is adopted among the different semantic feature maps, the low-semantic feature map is upwards superposed on the feature map with a high-level semantic, and finally, a mask is output through a plurality of convolution layers to predict the feature map, loss is generated with Gaussian mask labeling, and therefore the detection performance of the small-scale target is improved.
In 2020, white xiaying et al proposed a pedestrian detection model based on the attention mechanism, which uses the body frame of YOLOv3, incorporating the channel attention and spatial attention mechanisms for residual joining. Using the structure of YOLOv3 allows for more abundant features to be extracted. According to the invention, the conditions of missing detection and false detection in the end-to-end pedestrian target detection algorithm are reduced according to the real-time detection performance of the YOLOv3 algorithm, the extracted feature vectors are fully utilized and corrected, and meanwhile, a single connection mode in a residual connection structure is modified, so that the feature vectors beneficial to subsequent detection can be better screened out by the whole network.
At present, the conventional pedestrian detection method has been slowly replaced by the deep learning method. Conventional pedestrian detection methods use artificially designed features that are difficult to adapt to changes in human body morphology. Researchers find that features expressed based on deep learning have strong hierarchical expression capability and good robustness, and can better solve some visual problems.
Because deep convolution has a good effect in pedestrian detection, more and more people improve a target detection framework and apply the target detection framework to pedestrian detection. However, most frames are anchor-based methods, and for some scenes with fewer pedestrians, the anchor-based methods consume a lot of time and space to generate a lot of invalid anchor frames without pedestrians. In order to enable the model to focus better on pedestrians, spatial attention mechanisms and channel attention mechanisms have shown great potential. However, most existing methods develop a more complicated attention module as a basic task for achieving better classification effect, and meanwhile, the complexity of calculation is continuously increased.
The pedestrian detection methods provided above all have certain effects, but when comparing with real frames of pedestrians, anchor frames with too many length-width ratios are set, and in scenes with few pedestrians, such an anchor frame mechanism can enable a model to generate a large number of anchor frames without targets. This increases the computational load of the model while consuming a lot of space.
Disclosure of Invention
Based on the defects of the prior art, the technical problem solved by the invention is to provide a multi-scale feature fusion pedestrian detection method based on an attention mechanism, which can greatly reduce the calculation time of a model, change the fusion mode of a feature pyramid, enable the features of each layer to be fused more fully, and further integrate a CBAM module on the basis, so that the pedestrian position can be focused more directly, and better and more accurate detection can be performed.
In order to solve the technical problem, the invention provides a multi-scale feature fusion pedestrian detection method based on an attention mechanism, which comprises the following steps of:
step 1: inputting a training set and a verification set, extracting pedestrian features and generating a feature map;
step 2: inputting a network model and training the model;
and step 3: and if the batch reaches the specified batch, outputting the model and verifying the model.
Optionally, in step 1, calculating the input feature map through a channel attention module to obtain a unique channel attention map, and multiplying the input feature map and the channel attention map at a pixel level to obtain a salient feature map; and then, calculating a two-dimensional spatial attention diagram through a spatial attention module, and finally multiplying the salient feature map and the pixel level of the spatial attention diagram to output a final salient feature map.
Optionally, the channel attention first uses maximum pooling and average pooling for the feature maps respectively, generates two different spatial content description feature maps respectively, sends the two spatial content description feature maps to a shared network simultaneously to calculate to obtain a channel attention map, and finally fuses output vectors of the shared network in a point-by-point pixel bitwise addition manner.
From the above, the attention mechanism-based multi-scale feature fusion pedestrian detection method mainly solves the problems of low target detection precision and missing detection under the conditions that the size of pedestrians is small and objects are shielded in a scene. The FCOS algorithm is applied to pedestrian detection, a dense pyramid structure is adopted on the basis, top-layer features and bottom-layer features are fused, so that the fused features have bottom-layer spatial information and top-layer feature detail information, and a travel person target can be better identified. Secondly, the fused features are fused into space and channel attention, so that the pedestrian targets can be accurately positioned.
The foregoing description is only an overview of the technical solutions of the present invention, and in order to make the technical means of the present invention more clearly understood, the present invention may be implemented in accordance with the content of the description, and in order to make the above and other objects, features, and advantages of the present invention more clearly understood, the following detailed description is given in conjunction with the preferred embodiments, together with the accompanying drawings.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings of the embodiments will be briefly described below.
FIG. 1 is a flowchart of a multi-scale feature fusion pedestrian detection method based on an attention mechanism according to the present invention.
Detailed Description
Other aspects, features and advantages of the present invention will become apparent from the following detailed description, taken in conjunction with the accompanying drawings, which form a part of this specification, and which illustrate, by way of example, the principles of the invention. In the referenced drawings, the same or similar components in different drawings are denoted by the same reference numerals.
As shown in FIG. 1, the multi-scale feature fusion pedestrian detection method based on the attention mechanism comprises the following steps:
step 1: inputting a training set and a verification set, extracting pedestrian features and generating a feature map;
step 2: inputting a network model and training the model;
and step 3: and if the batch reaches the specified batch, outputting the model and verifying the model.
Given a middle layer feature map F e RC H W of CNN as the input feature map of CBAM, the number of channels is C, and the width and height of each channel feature map are W and H, respectively. The CBAM firstly calculates the input feature map F through a channel attention module to obtain a unique channel attention map AcE is RC 1, F and A are addedcObtaining a significant feature map F after pixel-level multiplicationcE.g. RC H W; then FcTwo-dimensional space attention diagram A calculated by a space attention modulesE R1H W, finally FcAnd AsOutputting final significant characteristic graph F after pixel-level multiplicationR∈RC*H*W。
Wherein the channel attention first uses maximum pooling and average pooling respectively for the feature map F to generate two different spatial content description feature maps respectively:
Figure BDA0003215161900000051
and
Figure BDA0003215161900000052
simultaneously sending the two space content description characteristic diagrams into a shared network to calculate a channel attention diagram Fc. The shared network is composed of a multilayer perceptron with a hidden layer, and finally the output vectors of the shared network are fused to obtain F by using a mode of adding point pixels bit by bitc
Similarly, the Spatial Attention Module (Spatial Attention Module) compresses channels, and performs mean pooling and maximum pooling, respectively, in channel dimensions. The operation of the MaxPool is to extract the maximum value on a channel, and the extraction times are height multiplied by width; the AvgPool operation is to extract an average value on a channel, and the extraction times are also height times width; then, the feature maps (the number of channels is all 1) extracted previously are combined to obtain a feature map of 2 channels.
Secondly, the CBAM attention module is fused into a dense feature pyramid, so that the feature map can be corrected layer by layer. In addition to performing horizontal connection and upsampling connection on the feature pyramids { P2, P3, P4 and P5}, the top-level features are connected to the lower-level features, so that the fused features not only have semantic information respectively possessed by a plurality of top-level features, but also have spatial information of the lower-level features. Therefore, the detection precision can be better improved. However, redundant features still exist in the features after dense connection, so that the features are fused into CBAM to obtain the significant feature maps { A2, A3, A4 and A5} of each layer, and finally, the transverse connection and the dense connection adopted by the feature maps of each layer are connected.
In the embodiment, the results of the comparative experiment between the method of the present invention and the two-stage model FasterR-CNN, the single-stage model Retianet and the latest method are also shown in Table 1. The four methods all use the same Caltech pedestrian data set, so that the improvement of the pedestrian detection precision by the method can be verified. In addition, the present invention uses the average Accuracy (AP) as an evaluation index.
Table 1 comparison of detection performance on Caltech dataset by different methods
Figure BDA0003215161900000061
The invention adopts a one-stage full convolution model, the model adopts a method without an anchor-free frame, each coordinate point is directly predicted on the basis of FCN, the distance of the upper, lower, left and right sides of the corresponding target is detected, and thus the number of detected positive samples is increased. Compared with the existing method, on one hand, the method avoids the problem that the time of model calculation is increased due to the fact that a large number of target frames are generated in the anchor frame-based method, and on the other hand, anchors are difficult to cover all targets, so that the generalization capability is poor. On the other hand, because the dense connection and the CBAM attention module are integrated, the extracted features can better represent pedestrians, the pedestrian area can be rapidly and accurately positioned, and the detection precision is improved.
While the foregoing is directed to the preferred embodiment of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow.

Claims (3)

1. A multi-scale feature fusion pedestrian detection method based on an attention mechanism is characterized by comprising the following steps:
step 1: inputting a training set and a verification set, extracting pedestrian features and generating a feature map;
step 2: inputting a network model and training the model;
and step 3: and if the batch reaches the specified batch, outputting the model and verifying the model.
2. The pedestrian detection method based on the attention mechanism and multi-scale feature fusion of the claim 1 is characterized in that in the step 1, the input feature map is calculated through a channel attention module to obtain a unique channel attention map, and then the input feature map and the channel attention map are multiplied through a pixel level to obtain a significant feature map; and then, calculating a two-dimensional spatial attention diagram through a spatial attention module, and finally multiplying the salient feature map and the pixel level of the spatial attention diagram to output a final salient feature map.
3. The pedestrian detection method based on attention mechanism and multi-scale feature fusion of claim 2, wherein the channel attention is obtained by firstly respectively using maximum pooling and average pooling on the feature maps to respectively generate two different spatial content description feature maps, simultaneously sending the two spatial content description feature maps to a shared network to calculate the channel attention map, and finally fusing output vectors of the shared network in a point-by-point pixel bitwise addition mode.
CN202110941462.3A 2021-08-17 2021-08-17 Multi-scale feature fusion pedestrian detection method based on attention mechanism Pending CN113673593A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110941462.3A CN113673593A (en) 2021-08-17 2021-08-17 Multi-scale feature fusion pedestrian detection method based on attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110941462.3A CN113673593A (en) 2021-08-17 2021-08-17 Multi-scale feature fusion pedestrian detection method based on attention mechanism

Publications (1)

Publication Number Publication Date
CN113673593A true CN113673593A (en) 2021-11-19

Family

ID=78543201

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110941462.3A Pending CN113673593A (en) 2021-08-17 2021-08-17 Multi-scale feature fusion pedestrian detection method based on attention mechanism

Country Status (1)

Country Link
CN (1) CN113673593A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110084210A (en) * 2019-04-30 2019-08-02 电子科技大学 The multiple dimensioned Ship Detection of SAR image based on attention pyramid network
CN110705457A (en) * 2019-09-29 2020-01-17 核工业北京地质研究院 Remote sensing image building change detection method
CN111767882A (en) * 2020-07-06 2020-10-13 江南大学 Multi-mode pedestrian detection method based on improved YOLO model

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110084210A (en) * 2019-04-30 2019-08-02 电子科技大学 The multiple dimensioned Ship Detection of SAR image based on attention pyramid network
CN110705457A (en) * 2019-09-29 2020-01-17 核工业北京地质研究院 Remote sensing image building change detection method
CN111767882A (en) * 2020-07-06 2020-10-13 江南大学 Multi-mode pedestrian detection method based on improved YOLO model

Similar Documents

Publication Publication Date Title
CN111126472B (en) SSD (solid State disk) -based improved target detection method
CN109241982B (en) Target detection method based on deep and shallow layer convolutional neural network
CN111259786B (en) Pedestrian re-identification method based on synchronous enhancement of appearance and motion information of video
CN111126202A (en) Optical remote sensing image target detection method based on void feature pyramid network
WO2021218786A1 (en) Data processing system, object detection method and apparatus thereof
CN111461083A (en) Rapid vehicle detection method based on deep learning
CN112070044B (en) Video object classification method and device
CN113609896A (en) Object-level remote sensing change detection method and system based on dual-correlation attention
CN111353544B (en) Improved Mixed Pooling-YOLOV 3-based target detection method
CN112801027A (en) Vehicle target detection method based on event camera
CN111008633A (en) License plate character segmentation method based on attention mechanism
CN113920468B (en) Multi-branch pedestrian detection method based on cross-scale feature enhancement
Liu et al. Extended faster R-CNN for long distance human detection: Finding pedestrians in UAV images
CN112036260A (en) Expression recognition method and system for multi-scale sub-block aggregation in natural environment
CN113297959A (en) Target tracking method and system based on corner attention twin network
Wang et al. TF-SOD: a novel transformer framework for salient object detection
Wang et al. Global contextual guided residual attention network for salient object detection
CN110852199A (en) Foreground extraction method based on double-frame coding and decoding model
CN114913604A (en) Attitude identification method based on two-stage pooling S2E module
Wei et al. Bidirectional attentional interaction networks for rgb-d salient object detection
Wang et al. Summary of object detection based on convolutional neural network
CN114758285B (en) Video interaction action detection method based on anchor freedom and long-term attention perception
CN113673593A (en) Multi-scale feature fusion pedestrian detection method based on attention mechanism
CN113076902B (en) Multitasking fusion character fine granularity segmentation system and method
CN114998879A (en) Fuzzy license plate recognition method based on event camera

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination