CN111680619A - Pedestrian detection method based on convolutional neural network and double-attention machine mechanism - Google Patents

Pedestrian detection method based on convolutional neural network and double-attention machine mechanism Download PDF

Info

Publication number
CN111680619A
CN111680619A CN202010506077.1A CN202010506077A CN111680619A CN 111680619 A CN111680619 A CN 111680619A CN 202010506077 A CN202010506077 A CN 202010506077A CN 111680619 A CN111680619 A CN 111680619A
Authority
CN
China
Prior art keywords
attention
channel
map
feature map
double
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010506077.1A
Other languages
Chinese (zh)
Inventor
周东生
张运波
易鹏飞
杨鑫
张强
魏小鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian University
Original Assignee
Dalian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian University filed Critical Dalian University
Priority to CN202010506077.1A priority Critical patent/CN111680619A/en
Publication of CN111680619A publication Critical patent/CN111680619A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a pedestrian detection method based on a convolutional neural network and a double-attention machine mechanism, which comprises the following steps of: inputting images from the Caltech dataset and the CityPersons dataset; a convolutional neural network based on a double-attention machine system is used as a main network to extract image features, and a detection part classifies and regresses the features; and the pedestrian is framed out in the form of a frame. The invention provides a lightweight double-attention-machine modeling method, which can not only model the relationship between characteristic channels, but also improve the expression capability of a characteristic diagram at a pixel level. The invention constructs a single-stage pedestrian detector CSANT based on a double-attention machine system, and further analyzes the performance influence factors in the CSANT through experiments. CSANT achieves the latest performance of a Caltech reference and the competitive performance of a CityPersons reference while maintaining the computational efficiency.

Description

Pedestrian detection method based on convolutional neural network and double-attention machine mechanism
Technical Field
The invention relates to the technical field of pedestrian detection in computer vision, in particular to a pedestrian detection method based on a convolutional neural network and an attention mechanism.
Background
Pedestrian detection plays a crucial role in computer vision tasks such as autopilot, robotics and surveillance. With the rise of deep learning, the pedestrian detector in recent years has made great progress in the revival of deep learning. However, the current state-of-the-art pedestrian detectors still far from reaching cognitive levels as fast and accurate as humans. Current mainstream pedestrian detectors tend to directly benefit from Convolutional Neural Networks (CNNs) designed for image classification. Although CNNs require a large number of downsampling factors to generate high-level semantic features, they cannot adaptively focus on useful channels and regions of the feature map, which limits the improvement of pedestrian detection performance.
Notably, pedestrians in traffic scenes have different characteristics from general objects, such as diverse backgrounds and multi-scale pedestrians. Typically, researchers will employ depth models to abstract high-level semantics of object instances, which help identify pedestrians in traffic scenes. Unfortunately, this approach filters out the location information of many small-scale pedestrians as well as large-scale pedestrians. Due to the inherent nature of CNN, critical channels cannot be highlighted and critical spatial locations cannot be illuminated. Convolution is a local operation that obtains local information of an image by applying a convolution kernel to a local image. The local operation of CNN results in its inability to capture images from a global view. Therefore, designing an effective backbone for pedestrian detection remains a difficult task.
Disclosure of Invention
In accordance with the technical problem set forth above, a pedestrian detection method based on a convolutional neural network and an attention mechanism is provided. The invention mainly relates to a pedestrian detection method based on a convolutional neural network and a double-attention machine mechanism, which is characterized by comprising the following steps of:
step S1: input is from Caltech[1]Data sets and CityPersons[2]An image of the data set;
step S2: a convolutional neural network based on a double-attention machine system is used as a main network to extract image features, and a detection part classifies and regresses the features;
step S3: framing the pedestrian in a frame mode;
further, the mathematical modeling process of the convolutional neural network based on the dual-attention mechanism mainly comprises the following steps:
step S21: given the output of the residual block
Figure BDA0002526588150000021
Defining as an original characteristic graph; the CAM and SAM are sequentially deduced to obtain a 1D channel attention diagram and a 2D space attention diagram; the raw feature map is sequentially re-labeled by two attention figures as:
Figure BDA0002526588150000022
Figure BDA0002526588150000023
wherein the content of the first and second substances,
Figure BDA0002526588150000024
representing addition per pixel, MCRepresenting a channel attention map, MCRepresenting a spatial attention map; fCRepresents passing through MCCalibrated characteristic map, FSRepresents passing through MSA calibrated characteristic diagram;
step S22: compressing a 2D feature map into real numbers along a spatial axis direction of the original feature map; aggregating primitives by using global pooling operationsCharacteristic diagram
Figure BDA0002526588150000025
To obtain a channel attention map
Figure BDA0002526588150000026
Figure BDA0002526588150000027
Wherein GAP represents a global average pooling operation, fcRepresenting the original feature map F, ucA c-th real number representing the channel profile U;
step S23: will be provided with
Figure BDA0002526588150000028
Inputting the data into two full-connection layers, and obtaining a final channel attention feature map through a sigmoid activation function
Figure BDA0002526588150000029
MC=σ(W2(W1U))(12);
Wherein, the role of the activation function ReLU is expressed, the sigma is expressed as sigmoid activation, and W is expressed as a scaling parameter in the full connection layer, comprising
Figure BDA00025265881500000210
And
Figure BDA00025265881500000211
r is the compression ratio set to 16;
step S24: using MCFor input characteristic diagram
Figure BDA00025265881500000212
Carrying out feature recalibration; firstly, M isCThe attention map, which is broadcast to the same dimension as F, is shown as
Figure BDA00025265881500000213
Then M 'is processed by pixel addition operation'CBroadcasting the characteristic diagram F to obtain a calibrated characteristic diagram FC
f′c=Fadd(m′c,fc)=m′c+fc.(13);
Wherein, FaddDenotes addition by channel, m'cIs a characteristic map M'CThe c channel of (f)cIs the c-th channel of the original feature map F; f'cIs represented by FCOf the c channel, m'cRepresents M'CGlobal information of the c-th channel;
step S25: compressing the 3D feature map into 2D feature channels along a channel axis direction of the feature map in order to compute an effective channel attention map; characteristic diagram after calibration of given channel attention diagram
Figure BDA0002526588150000031
Feature map acquisition using average pooling operations
Figure BDA0002526588150000032
Comprises the following steps:
Figure BDA0002526588150000033
where AP denotes average pooling operation, fijIs f'cThe pixel value of the middle ij point, C represents the number of characteristic channels, vijA pixel value representing an ij point in the feature map V;
step S26, convolving the characteristic diagram V by the convolution layer with the convolution kernel size of 7 × 7 and the moving step size of 1, and then using the sigmoid activation function to obtain the attention diagram
Figure BDA0002526588150000034
MS=σ(f7×7(V)) (15);
Where σ denotes the role of the activation function sigmoid, f7×7A convolution operation representing a convolution kernel of 7 × 7;
step S27: using MSRecalibration profile FCFirst step MSIs broadcast as sum FCA feature map with the same dimension is expressed as
Figure BDA0002526588150000035
f′s=Fadd(m′s,fs)=m′s+fs. (16);
FaddRepresents addition by space, m'sIs M'SThe s channel of (a), (b), f)sIs FCOf s channel, f'sIs FSThe s channel of (1).
Further, the dual attention module is organized using a serial-in-sequence approach using additive operations in the broadcast attention map process, and the dual attention map mechanism is embedded in the convolutional layers of the ResNet-50.
Compared with the prior art, the invention has the following advantages:
1. the double-attention-machine system modeling method is capable of not only modeling the relation between the characteristic channels, but also improving the expression capacity of the characteristic diagram at the pixel level.
2. A single-stage pedestrian detector CSANT based on a double-attention machine mechanism is constructed, and performance influence factors in the CSANT are further analyzed through experiments.
3. CSANT achieves the latest performance of a Caltech reference and the competitive performance of a CityPersons reference while maintaining the computational efficiency.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 shows the overall architecture of CSANET in accordance with the present invention. The device mainly comprises two parts, a backbone network module and a detection head. An example of a dual attention module embedded in the Resnet-50 is shown in a dashed box, which in turn combines a Channel Attention Model (CAM) and a Spatial Attention Model (SAM).
FIG. 2 is a network structure of the channel attention module and the spatial attention module of the present invention. The feature map is shown in feature dimensions, such as H × W × C for a height H, width W and channel C, representing a broadcast operation that adds element-by-element.
Fig. 3 shows the comparison result of CSANet of the present invention and other advanced pedestrian detector models under the condition that the threshold value is IoU ═ 0.5, which is shown as FPPI curve.
Fig. 4 shows the comparison result of CSANet of the present invention and other advanced pedestrian detector models under the condition that the threshold value is IoU-0.75, which is shown as FPPI curve.
FIG. 5 is a model visualization of the present invention comparing a visualization using a dual attention model with a visualization without a dual attention model.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
As shown in the figure, the invention provides a pedestrian detection method based on a convolutional neural network and a double-attention machine mechanism, which is characterized by comprising the following steps of:
s1: input is from Caltech[1]Data sets and CityPersons[2]An image of the data set;
s2: the convolution neural network based on the double-attention machine system is used as a main network to extract image characteristics, and the detection part classifies and regresses the characteristics;
s3: and the pedestrian is framed out in the form of a frame.
The overall framework of CSANet is shown in fig. 1, with the backbone network being a ResNet-50 embedded with dual attention modules. The detection head module mainly comprises three convolution layers which respectively predict the center position, the scale and the offset of the pedestrian. ResNet-50 is divided into 5 stages, and the output characteristic graphs of 2 to 5 stages are respectively defined as
Figure BDA0002526588150000051
And
Figure BDA0002526588150000052
the feature map of stage2-5 was downsampled by 4, 8, 16, and 16, respectively. Wherein, the low-level feature map can provide more accurate position information, and the deeper feature map contains more semantic information. And (4) connecting the multi-scale feature maps of each stage in series in a simple manner to obtain a fused feature map. Before feature map fusion, the resolution of the output feature maps at each stage is unified using a deconvolution operation. In general, shallow features are referred to as generic features, while the semantic information expressed by each channel of deep features is category-specific.
Taking the third residual block of phase 5 as an example, the process of the dual attention network broadcast attention map is as follows:
given the output of the residual block
Figure BDA0002526588150000053
It is defined as the original feature map. The CAM and SAM are derived in turn to derive a 1D channel attention map and a 2D spatial attention map. The original features are sequentially re-scaled by two attention figures. The calculation process of the two calibration feature maps can be summarized as follows:
Figure BDA0002526588150000054
Figure BDA0002526588150000055
here, the
Figure BDA0002526588150000056
Representing addition per pixel, MCIs a channel attention map, MCIs a spatial attention map. FCIs passing through MCCalibrated characteristic map, FSIs passing through MSAnd (5) calibrating the characteristic diagram.
Fig. 2(a) shows a network structure of the channel attention module. To compute an effective channel attention map, the 2D feature map is compressed into real numbers along the spatial axis direction of the feature map. First aggregating raw feature maps using global pooling operations
Figure BDA0002526588150000057
To obtain a channel attention map
Figure BDA0002526588150000058
The whole calculation process is as follows:
Figure BDA0002526588150000061
here, GAP denotes a global average pooling operation, fcRepresenting the original feature map F, ucRepresenting the channel characteristicsThe c-th real number of graph U.
Then will be
Figure BDA0002526588150000062
Inputting the data into two full-connection layers, and obtaining a final channel attention feature map through a sigmoid activation function
Figure BDA0002526588150000063
The whole calculation process is expressed as:
MC=σ(W2(W1U)),(20)
here, the two fully-connected layers are for better fitting the complex correlation between channels, representing the effect of the activation function ReLU, σ representing sigmoid activation, W representing scaling parameters in the fully-connected layers, including
Figure BDA0002526588150000064
And
Figure BDA0002526588150000065
r is the compression ratio and is set to 16.
Finally, M is usedCFor input characteristic diagram
Figure BDA0002526588150000066
And carrying out characteristic recalibration. First step, MCThe attention map, which is broadcast to the same dimension as F, is shown as
Figure BDA0002526588150000067
Then M 'is processed by pixel addition operation'CBroadcasting the characteristic diagram F to obtain a calibrated characteristic diagram FCThe whole calculation process is as follows:
f′c=Fadd(m′c,fc)=m′c+fc.(21)
here, FaddDenotes addition by channel, m'cIs a characteristic map M'CThe c channel of (f)cIs the c-th channel of the original feature map F. f'cIs FCOf the c channel, m'cRepresents M'CGlobal information of the c-th channel.
Fig. 2(b) shows a network structure of the channel attention module. To compute an effective channel attention map, the 3D feature map is compressed along the channel axis direction of the feature map into 2D feature channels. Firstly, a characteristic diagram after the attention diagram calibration of a given channel is given
Figure BDA0002526588150000068
Feature map acquisition using average pooling operations
Figure BDA0002526588150000069
The calculation process is as follows:
Figure BDA00025265881500000610
where AP denotes an average pooling operation, fijIs f'cThe pixel value of the point of intermediate ij. C represents the number of characteristic channels, vijThe pixel values at the ij point in the feature map V are shown.
Then, a convolution layer with convolution kernel size of 7 × 7 and moving step size of 1 is used for convolving the feature map V, and then the sigmoid activation function is used for obtaining the attention map
Figure BDA00025265881500000611
The calculation process is as follows:
MS=σ(f7×7(V)),(23)
where σ denotes the activation function sigmoid action, f7×7Representing a convolution operation with a convolution kernel of 7 × 7.
Finally, M is usedSRecalibration profile FCFirst step MSIs broadcast as sum FCA feature map with the same dimension is expressed as
Figure BDA0002526588150000071
The whole calculation process is as follows:
f′s=Fadd(m′s,fs)=m′s+fs.(24)
here, FaddRepresents addition by space, m'sIs M'SThe s channel of (a), (b), f)sIs FCThe s channel of (1). f'sIs FSThe s channel of (1).
Spatial attention seeks to integrate the global context information of the feature map. Furthermore, spatial attention seeks to focus on the global information of each channel, increasing the size of the receptive field, and allowing the CNN to capture image information from a global perspective.
The channel attention module and the spatial attention module may be embedded in the ResNet-50 in a parallel or serial fashion. The channel attention module focuses on important channels, and the spatial attention module focuses on important regions of the feature map. The correct combination of the two attention modules may maximize the effectiveness of the attention mechanism.
Table 1 is a discussion of the generation method of the fusion profile in the ablation experiments. Wherein the column of Extraction methods is the Extraction method of each feature map, e.g. stage2-5 indicates that the double attention network is embedded in ResNet-50[3] Stages 2, 3, 4 and 5. The column of Channel represents the number of channels of the output profile of the backbone network in the corresponding model. Parameters represent the number of Parameters of the corresponding CSANet model. The Test time represents the time required to Test one picture. The bold font in the table indicates the optimal result in the corresponding column.
TABLE 1
Figure BDA0002526588150000072
Table 2 is a discussion of the signature fusion method in ablation experiments. In Feature maps
Figure BDA0002526588150000073
Figure BDA0002526588150000074
And
Figure BDA0002526588150000075
are respectively provided withOutput profiles representing stages 2, 3, 4 and 5 of ResNet-50. The column of Channel represents the number of channels of the output profile of the backbone network in the corresponding model. Parameters represent the number of Parameters of the corresponding CSANet model. The Test time represents the time required to Test one picture. The bold font in the table indicates the optimal result in the corresponding column.
TABLE 2
Figure BDA0002526588150000081
Table 3 is a discussion of additive and multiplicative broadcasts in ablation experiments. The Description column represents the corresponding model. For example, p3p4p5+ add represents a fusion feature map
Figure BDA0002526588150000082
And
Figure BDA0002526588150000083
and using an additive broadcast attention map; p3p4p5+ multiplex denotes the fusion profile
Figure BDA0002526588150000084
And
Figure BDA0002526588150000085
and broadcast attention maps using multiplications. The column of Channel represents the number of channels of the output profile of the backbone network in the corresponding model. Parameters represent the number of Parameters of the corresponding CSANet model. The Test time represents the time required to Test the work (of) one picture. The bold font in the table indicates the optimal result in the corresponding column.
TABLE 3
Figure BDA0002526588150000086
Table 4 is a discussion of the connection of the channel and spatial attention module in ablation experiments. The Description column represents the corresponding model. CAM + SAM means that the Channel Attention Module (CAM) and the Spatial Attention Module (SAM) are connected in series. CAM// SAM means that CAM and SAM are connected in series. SAM + CAM means that the Spatial Attention Module (SAM) and the Channel Attention Module (CAM) are in series. The column of Channel represents the number of channels of the output profile of the backbone network in the corresponding model. Parameters represent the number of Parameters of the corresponding CSANet model. The Test time represents the time required to Test one picture. The bold font in the table indicates the optimal result in the corresponding column.
TABLE 4
Figure BDA0002526588150000091
Table 5 compares the most advanced detectors on CityPersons when IoU ═ 0.5. The Hardware column represents the GPU equipment used for network training, while the Scale column represents the number of GPUs. Bold numbers indicate the best results.
TABLE 5
Figure BDA0002526588150000092
Example (b):
results and analysis of the experiments
(1) Ablation experiment
As shown in Table 1, the embedding method of stage3-5 achieved an MR of 3.88% at IoU ═ 0.5-2The best performance of (1). At IoU ═ 0.75, the stage2-5 embedding method improved performance by about 36% compared to the stage5 embedding method. Notably, at IoU ═ 0.5, stage2-4 and stage5 had comparable performance, representing 4.28% MR respectively-2And 4.27% MR-2. However, at the threshold of IoU-0.75, the performance difference between them is large, 4.77% MR-2. This comparison shows that the double attention network is more favorable for regression of high quality bounding boxes.
As can be seen from Table 2, the model fused with the low-level features has poor accuracy, but the model has less parameter quantity and higher detection speed. As the number of the fused feature maps is increased, the detection precision of the model is improved. In the case of IoU ═ 0.5, fusion
Figure BDA0002526588150000093
And
Figure BDA0002526588150000094
the model of (2) has, with the fused model, a significant improvement of 47%, but the least accurate. At IoU ═ 0.75,
Figure BDA0002526588150000101
and
Figure BDA0002526588150000102
the fusion effect of (2) is optimal. Generally, deeper features are helpful for pedestrian detection, but they take up more operating memory.
As can be seen from table 3, the model using the additive broadcast is better than the model using the multiplicative broadcast. MR between p3p4p5+ add and p3p4p5+ multiply-2The difference is about 37%. In both the second and third set of experiments, the difference was about 17%. Furthermore, it was found that the broadcast method of attention-seeking hardly affects the test time. In practice, runtime is mainly affected by model parameters.
In fact, multiplication is much more computationally complex than addition. Although the multiplication enhances the useful information representation of the feature map, it also over-amplifies the effect of noise. In addition, the multiplicative weighting operation unduly suppresses some contextual details, which is detrimental to locating pedestrians. In addition to accuracy, the pedestrian detection task also needs to consider the real-time indicators of the model. The multiplication increases the running time of the network to some extent.
As shown in Table 4, CAM + SAM indicates that the channel attention module and the spatial attention module are connected in a sequential manner. Analysis shows that sequential arrangement has better results than parallel arranged models. The best result of the CAM-first mode is 3.88% MR-2. CAM// SAM indicates that the two modules are arranged in parallel, with a performance 0.27% lower than CAM + SAM-2. In the third mode, SAM + CAM SAM takes precedence in the dual attention network, and the model has the worst performance, 4.57% MR-2
(2) Visualization experiment
As shown in fig. 5, the experimental results are visualized for the model. Using a visualization algorithm Grad-CAM[4]To qualitatively explain the model for CSANet. The interpretability of CNN has been improved to a certain level. The algorithm may derive a class activation map that may be used to locate regions of classes in the image. The Grad-CAM uses mainly the gradient of the last convolutional layer of the network to generate a heatmap, which can highlight important pixels in the input image.
As can be seen from fig. 3, the thermodynamic diagram of the model with dual attention networks covers the pedestrian area better than the model without dual attention networks. In other words, the dual attention network may better focus on the pixel information of the target area. The visualization result qualitatively shows that the pixel expression capability of the target area is enhanced to a certain extent in the improved feature map.
(3) Comparative experiments on SOTA
Fig. 3(IoU ═ 0.5) and fig. 4(IoU ═ 0.75) are comparisons of performance of the advanced pedestrian detection algorithm on the Caltech dataset, respectively. The algorithm includes DeepParts[5]、MS-CNN[6]、SA-FasterRCNN[7]、RPN+BF[8]、FasterRCNN+ATT[9]、SDS-RCNN[10]、AdaptFasterRCNN[2]、CSP+City[11]、CSANet[ours]And CSANET + City[ours]
As shown in FIG. 3, the model initialized from the City Persons dataset has the best performance, compared to the current high performance method CSP + City, CSANT has 3.55% MR-2The best performance of (1). CSANET model initialized with ImageNet dataset exceeded 3.88% MR-2This indicates a significant improvement over the baseline model. As shown in fig. 4, the CSANet model also achieves a smaller loss rate at a tighter threshold setting, which means that the dual attention network also helps to improve the quality of the bounding box.
Table 5 shows the comparison of advanced pedestrian detectors on the CityPersons dataset. This set of experiments was at IoU ═ 0.5, and only a single NVIDIA GTX 1080Ti GPU and mini-batch ═ 2 were used in training the network. Table 5 shows that the CSANET detector is in CityPersons achieved 7.25% of the latest performance on the Bare subset. On a reasonable subset, CSANet's performance is only inferior to CSP trained with mini-batch ═ 8[11]And (4) modeling. The CSANET detector can be connected with ALFNet[12]A comparison with a competitor Reploss[13]Compared with the prior art, the quality can be improved by about 2.6%. In fact, a larger batchsize will, within reasonable limits, make the gradient descent direction more accurate.
Reference to the literature
[1]P.Dollar,C.Wojek,B.Schiele,and P.Perona,“Pedestrian detection:Anevaluation of the state of the art,”IEEE Trans.Pattern Anal.Mach.Intell.(PAMI),vol.34,no.4,pp.743-761,Apr.2012.
[2]S.Zhang,R.Benenson,and B.Schiele,“Citypersons:A diverse datasetfor pedestrian detection,”in Proc.IEEE Comput.Vis.Pattern Recognit.(CVPR),Jul.2017,pp.3213-3221.
[3]K.He,X.Zhang,S.Ren,et al.,“Deep Residual Learning for ImageRecognition,”in Proc.IEEE Comput.Vis.Pattern Recognit.(CVPR),June,2016,pp.770-778.
[4]R.R.Selvaraju,M.Cogswell,et al.,“Grad-cam:Visual explanations fromdeep networks via gradient-based localization,”in Proc.IEEEInt.Conf.Comput.Vis.(ICCV),2017,pp.618-626.
[5]Y Tian,P Luo,X Wang,et al.,“Deep learning strong parts forpedestrian detection,”in Proc IEEE Int.Conf.Comput.Vis.(ICCV),Dec.2015,pp.1904-1912.
[6]Z.Cai,Q.Fan,R.S.Feris,and N.Vasconcelos,“A unified multi-scaledeep convolutional neural network for fast object detection,”inProc.Eur.Conf.Comput.Vis.(ECCV),2016,pp.354–370.
[7]J.Li,X.Liang,S.Shen,T.Xu,J.Feng,and S.Yan,“Scale-aware fast R-CNNfor pedestrian detection,”IEEE Trans.Multimedia(ICME),vol.20,no.4,pp.985-996,Apr,2017.
[8]L.Zhang,L.Lin,X.Liang,and K.He,“Is faster r-cnn doing well forpedestrian detection?,”in Eur.Conf.Comput.Vis.(ECCV),Oct.2016,pp.443-457.
[9]S.Zhang,J.Yang,and B.Schiele,“Occluded pedestrian detectionthrough guided attention in CNNs,”in Proc.IEEE conf.Comput.Vis.PatternRecognit.(CVPR),Jun.2018,pp.6995-7003.
[10]G.Brazil,X.Yin,X.Liu,“Illuminating pedestrians via simultaneousdetection&segmentation,”in Proc IEEE Int.Conf.Comput.Vis.(ICCV),2017,pp.4950-4959.
[11]W.Liu,S.Liao,W.Ren,W.Hu,and Y.Yu,“High-level Semantic FeatureDetection:A New Perspective for Pedestrian Detection,”in Proc.IEEEComput.Vis.Pattern Recognit.(CVPR),Jun.2019,pp.5187-5196.
[12]W.Liu,S.Liao,W.Hu,X.Liang,and X.Chen,“Learning efficient single-stage pedestrian detectors by asymptotic localization fitting,”inProc.Eur.Conf.Comput.Vis.(ECCV),2018,pp.618-634.
[13]X.Wang,T.Xiao,Y.Jiang,S.Shao,J.Sun,and C.Shen,“Repulsion loss:Detecting pedestrians in a crowd,”in Proc.IEEE Comput.Vis.Pattern Recognit.(CVPR),Jun.2018,pp.7774-7783.
[14]T.Song,L.Sun,D.Xie,H.Sun,and S.Pu,“Small-scale pedestriandetection based on topological line localization and temporal featureaggregation,”in Proc.Eur.Conf.Comput.Vis.(ECCV),2018,pp.536-551.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units may be a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (3)

1. The pedestrian detection method based on the convolutional neural network and the double-attention machine mechanism is characterized by comprising the following steps of:
s1: inputting images from the Caltech dataset and the CityPersons dataset;
s2: a convolutional neural network based on a double-attention machine system is used as a main network to extract image features, and a detection part classifies and regresses the features;
s3: and the pedestrian is framed out in the form of a frame.
2. The convolutional neural network and dual-attention mechanism-based pedestrian detection method of claim 1, wherein:
the mathematical modeling process of the convolutional neural network based on the double-attention mechanism mainly comprises the following steps of:
s21: given the output of the residual block
Figure FDA0002526588140000011
Defining as an original characteristic graph; the CAM and SAM are sequentially deduced to obtain a 1D channel attention diagram and a 2D space attention diagram; the raw feature map is sequentially re-labeled by two attention figures as:
Figure FDA0002526588140000012
Figure FDA0002526588140000013
wherein the content of the first and second substances,
Figure FDA0002526588140000014
representing addition per pixel, MCRepresenting a channel attention map, MCRepresenting a spatial attention map; fCRepresents passing through MCCalibrated characteristic map, FSRepresents passing through MSA calibrated characteristic diagram;
s22: compressing a 2D feature map into real numbers along a spatial axis direction of the original feature map; aggregating raw feature maps by using global pooling operations
Figure FDA0002526588140000015
To obtain a channel attention map
Figure FDA0002526588140000016
Figure FDA0002526588140000017
Wherein GAP represents a global average pooling operation, fcRepresenting the original feature map F, ucA c-th real number representing the channel profile U;
s23: will be provided with
Figure FDA0002526588140000018
Inputting the data into two full-connection layers, and obtaining a final channel attention feature map through a sigmoid activation function
Figure FDA0002526588140000019
MC=σ(W2(W1U)) (4);
Wherein, the role of the activation function ReLU is expressed, the sigma is expressed as sigmoid activation, and W is expressed as a scaling parameter in the full connection layer, comprising
Figure FDA0002526588140000021
And
Figure FDA0002526588140000022
r is the compression ratio set to 16;
s24: using MCFor input characteristic diagram
Figure FDA0002526588140000023
Carrying out feature recalibration; firstly, M isCThe attention map, which is broadcast to the same dimension as F, is shown as
Figure FDA0002526588140000024
Then M 'is processed by pixel addition operation'CBroadcasting the characteristic diagram F to obtain a calibrated characteristic diagram FC
fc′=Fadd(m′c,fc)=m′c+fc. (5);
Wherein, FaddDenotes addition by channel, m'cIs a characteristic map M'CThe c channel of (f)cIs the c-th channel of the original feature map F; f. ofc' means FCOf the c channel, m'cRepresents M'CGlobal information of the c-th channel;
s25: to compute an effective channel attention map, the 3D features are mapped along the channel axis of the feature mapCompressing the characteristic diagram into a 2D characteristic channel; characteristic diagram after calibration of given channel attention diagram
Figure FDA0002526588140000025
Feature map acquisition using average pooling operations
Figure FDA0002526588140000026
Comprises the following steps:
Figure FDA0002526588140000027
where AP denotes average pooling operation, fijDenotes fc' the pixel value of the point ij in, C represents the number of characteristic channels, vijA pixel value representing an ij point in the feature map V;
s26, convolving the feature map V by the convolution layer with the convolution kernel size of 7 × 7 and the moving step size of 1, and then using the sigmoid activation function to obtain the attention map
Figure FDA0002526588140000028
MS=σ(f7×7(V)) (7);
Where σ denotes the role of the activation function sigmoid, f7×7A convolution operation representing a convolution kernel of 7 × 7;
s27: using MSRecalibration profile FCFirst step MSIs broadcast as sum FCA feature map with the same dimension is expressed as
Figure FDA0002526588140000029
fs′=Fadd(m′s,fs)=m′s+fs. (8);
FaddRepresents addition by space, m'sIs M'SThe s channel of (a), (b), f)sIs FCThe s channel of (a), (b), f)s' is FSThe s channel of (1).
3. The convolutional neural network and dual-attention mechanism-based pedestrian detection method of claim 1, further characterized by:
the dual attention module is organized using a serial-in-sequence approach using additive operations in the broadcast attention map process, and the dual attention map mechanism is embedded in multiple convolutional layers of the ResNet-50.
CN202010506077.1A 2020-06-05 2020-06-05 Pedestrian detection method based on convolutional neural network and double-attention machine mechanism Pending CN111680619A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010506077.1A CN111680619A (en) 2020-06-05 2020-06-05 Pedestrian detection method based on convolutional neural network and double-attention machine mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010506077.1A CN111680619A (en) 2020-06-05 2020-06-05 Pedestrian detection method based on convolutional neural network and double-attention machine mechanism

Publications (1)

Publication Number Publication Date
CN111680619A true CN111680619A (en) 2020-09-18

Family

ID=72434993

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010506077.1A Pending CN111680619A (en) 2020-06-05 2020-06-05 Pedestrian detection method based on convolutional neural network and double-attention machine mechanism

Country Status (1)

Country Link
CN (1) CN111680619A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112560720A (en) * 2020-12-21 2021-03-26 奥比中光科技集团股份有限公司 Pedestrian identification method and system
CN112800964A (en) * 2021-01-27 2021-05-14 中国人民解放军战略支援部队信息工程大学 Remote sensing image target detection method and system based on multi-module fusion
CN113450366A (en) * 2021-07-16 2021-09-28 桂林电子科技大学 AdaptGAN-based low-illumination semantic segmentation method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110135243A (en) * 2019-04-02 2019-08-16 上海交通大学 A kind of pedestrian detection method and system based on two-stage attention mechanism
CN110675406A (en) * 2019-09-16 2020-01-10 南京信息工程大学 CT image kidney segmentation algorithm based on residual double-attention depth network
CN110991362A (en) * 2019-12-06 2020-04-10 西安电子科技大学 Pedestrian detection model based on attention mechanism
CN111160628A (en) * 2019-12-13 2020-05-15 重庆邮电大学 Air pollutant concentration prediction method based on CNN and double-attention seq2seq

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110135243A (en) * 2019-04-02 2019-08-16 上海交通大学 A kind of pedestrian detection method and system based on two-stage attention mechanism
CN110675406A (en) * 2019-09-16 2020-01-10 南京信息工程大学 CT image kidney segmentation algorithm based on residual double-attention depth network
CN110991362A (en) * 2019-12-06 2020-04-10 西安电子科技大学 Pedestrian detection model based on attention mechanism
CN111160628A (en) * 2019-12-13 2020-05-15 重庆邮电大学 Air pollutant concentration prediction method based on CNN and double-attention seq2seq

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
YUNBO ZHANG 等: "CSANet: Channel and Spatial Mixed Attention CNN for Pedestrian Detection" *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112560720A (en) * 2020-12-21 2021-03-26 奥比中光科技集团股份有限公司 Pedestrian identification method and system
CN112800964A (en) * 2021-01-27 2021-05-14 中国人民解放军战略支援部队信息工程大学 Remote sensing image target detection method and system based on multi-module fusion
CN112800964B (en) * 2021-01-27 2021-10-22 中国人民解放军战略支援部队信息工程大学 Remote sensing image target detection method and system based on multi-module fusion
CN113450366A (en) * 2021-07-16 2021-09-28 桂林电子科技大学 AdaptGAN-based low-illumination semantic segmentation method
CN113450366B (en) * 2021-07-16 2022-08-30 桂林电子科技大学 AdaptGAN-based low-illumination semantic segmentation method

Similar Documents

Publication Publication Date Title
Pang et al. Hierarchical dynamic filtering network for RGB-D salient object detection
Jiang et al. Crowd counting and density estimation by trellis encoder-decoder networks
Giraldo et al. Graph moving object segmentation
Tran et al. Deep end2end voxel2voxel prediction
Liao et al. Video-based person re-identification via 3d convolutional networks and non-local attention
CN110188239B (en) Double-current video classification method and device based on cross-mode attention mechanism
Liu et al. Del: Deep embedding learning for efficient image segmentation.
CN111680619A (en) Pedestrian detection method based on convolutional neural network and double-attention machine mechanism
Fang et al. Deep3DSaliency: Deep stereoscopic video saliency detection model by 3D convolutional networks
CN113158723A (en) End-to-end video motion detection positioning system
Duta et al. Histograms of motion gradients for real-time video classification
CN111797841B (en) Visual saliency detection method based on depth residual error network
Chen et al. Dr-tanet: Dynamic receptive temporal attention network for street scene change detection
Xie et al. Context-aware pedestrian detection especially for small-sized instances with Deconvolution Integrated Faster RCNN (DIF R-CNN)
CN111931603A (en) Human body action recognition system and method based on double-current convolution network of competitive combination network
Xia et al. Pedestrian detection algorithm based on multi-scale feature extraction and attention feature fusion
Hou et al. A super-fast deep network for moving object detection
Guo et al. Learning efficient stereo matching network with depth discontinuity aware super-resolution
Wen et al. Deep fusion based video saliency detection
Li et al. Msffa: a multi-scale feature fusion and attention mechanism network for crowd counting
Islam et al. Representation for action recognition with motion vector termed as: SDQIO
Liu et al. Hypergraph attentional convolutional neural network for salient object detection
Geng et al. Adaptive multi-level graph convolution with contrastive learning for skeleton-based action recognition
Basavaiah et al. Robust Feature Extraction and Classification Based Automated Human Action Recognition System for Multiple Datasets.
Byvshev et al. Are 3D convolutional networks inherently biased towards appearance?

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200918