CN111680619A

CN111680619A - Pedestrian detection method based on convolutional neural network and double-attention machine mechanism

Info

Publication number: CN111680619A
Application number: CN202010506077.1A
Authority: CN
Inventors: 周东生; 张运波; 易鹏飞; 杨鑫; 张强; 魏小鹏
Original assignee: Dalian University
Current assignee: Dalian University
Priority date: 2020-06-05
Filing date: 2020-06-05
Publication date: 2020-09-18

Abstract

The invention provides a pedestrian detection method based on a convolutional neural network and a double-attention machine mechanism, which comprises the following steps of: inputting images from the Caltech dataset and the CityPersons dataset; a convolutional neural network based on a double-attention machine system is used as a main network to extract image features, and a detection part classifies and regresses the features; and the pedestrian is framed out in the form of a frame. The invention provides a lightweight double-attention-machine modeling method, which can not only model the relationship between characteristic channels, but also improve the expression capability of a characteristic diagram at a pixel level. The invention constructs a single-stage pedestrian detector CSANT based on a double-attention machine system, and further analyzes the performance influence factors in the CSANT through experiments. CSANT achieves the latest performance of a Caltech reference and the competitive performance of a CityPersons reference while maintaining the computational efficiency.

Description

Pedestrian detection method based on convolutional neural network and double-attention machine mechanism

Technical Field

The invention relates to the technical field of pedestrian detection in computer vision, in particular to a pedestrian detection method based on a convolutional neural network and an attention mechanism.

Background

Pedestrian detection plays a crucial role in computer vision tasks such as autopilot, robotics and surveillance. With the rise of deep learning, the pedestrian detector in recent years has made great progress in the revival of deep learning. However, the current state-of-the-art pedestrian detectors still far from reaching cognitive levels as fast and accurate as humans. Current mainstream pedestrian detectors tend to directly benefit from Convolutional Neural Networks (CNNs) designed for image classification. Although CNNs require a large number of downsampling factors to generate high-level semantic features, they cannot adaptively focus on useful channels and regions of the feature map, which limits the improvement of pedestrian detection performance.

Notably, pedestrians in traffic scenes have different characteristics from general objects, such as diverse backgrounds and multi-scale pedestrians. Typically, researchers will employ depth models to abstract high-level semantics of object instances, which help identify pedestrians in traffic scenes. Unfortunately, this approach filters out the location information of many small-scale pedestrians as well as large-scale pedestrians. Due to the inherent nature of CNN, critical channels cannot be highlighted and critical spatial locations cannot be illuminated. Convolution is a local operation that obtains local information of an image by applying a convolution kernel to a local image. The local operation of CNN results in its inability to capture images from a global view. Therefore, designing an effective backbone for pedestrian detection remains a difficult task.

Disclosure of Invention

In accordance with the technical problem set forth above, a pedestrian detection method based on a convolutional neural network and an attention mechanism is provided. The invention mainly relates to a pedestrian detection method based on a convolutional neural network and a double-attention machine mechanism, which is characterized by comprising the following steps of:

step S1: input is from Caltech^[1]Data sets and CityPersons^[2]An image of the data set;

step S2: a convolutional neural network based on a double-attention machine system is used as a main network to extract image features, and a detection part classifies and regresses the features;

step S3: framing the pedestrian in a frame mode;

further, the mathematical modeling process of the convolutional neural network based on the dual-attention mechanism mainly comprises the following steps:

step S21: given the output of the residual block

Defining as an original characteristic graph; the CAM and SAM are sequentially deduced to obtain a 1D channel attention diagram and a 2D space attention diagram; the raw feature map is sequentially re-labeled by two attention figures as:

wherein the content of the first and second substances,

representing addition per pixel, M_CRepresenting a channel attention map, M_CRepresenting a spatial attention map; f_CRepresents passing through M_CCalibrated characteristic map, F_SRepresents passing through M_SA calibrated characteristic diagram;

step S22: compressing a 2D feature map into real numbers along a spatial axis direction of the original feature map; aggregating primitives by using global pooling operationsCharacteristic diagram

To obtain a channel attention map

Wherein GAP represents a global average pooling operation, f_cRepresenting the original feature map F, u_cA c-th real number representing the channel profile U;

step S23: will be provided with

Inputting the data into two full-connection layers, and obtaining a final channel attention feature map through a sigmoid activation function

M_C＝σ(W₂(W₁U))(12)；

Wherein, the role of the activation function ReLU is expressed, the sigma is expressed as sigmoid activation, and W is expressed as a scaling parameter in the full connection layer, comprising

And

r is the compression ratio set to 16;

step S24: using M_CFor input characteristic diagram

Carrying out feature recalibration; firstly, M is_CThe attention map, which is broadcast to the same dimension as F, is shown as

Then M 'is processed by pixel addition operation'_CBroadcasting the characteristic diagram F to obtain a calibrated characteristic diagram F_C：

f′_c＝F_add(m′_c,f_c)＝m′_c+f_c.(13)；

Wherein, F_addDenotes addition by channel, m'_cIs a characteristic map M'_CThe c channel of (f)_cIs the c-th channel of the original feature map F; f'_cIs represented by F_COf the c channel, m'_cRepresents M'_CGlobal information of the c-th channel;

step S25: compressing the 3D feature map into 2D feature channels along a channel axis direction of the feature map in order to compute an effective channel attention map; characteristic diagram after calibration of given channel attention diagram

Feature map acquisition using average pooling operations

Comprises the following steps:

where AP denotes average pooling operation, f_ijIs f'_cThe pixel value of the middle ij point, C represents the number of characteristic channels, v_ijA pixel value representing an ij point in the feature map V;

step S26, convolving the characteristic diagram V by the convolution layer with the convolution kernel size of 7 × 7 and the moving step size of 1, and then using the sigmoid activation function to obtain the attention diagram

M_S＝σ(f^7×7(V)) (15)；

Where σ denotes the role of the activation function sigmoid, f^7×7A convolution operation representing a convolution kernel of 7 × 7;

step S27: using M_SRecalibration profile F_CFirst step M_SIs broadcast as sum F_CA feature map with the same dimension is expressed as

f′_s＝F_add(m′_s,f_s)＝m′_s+f_s. (16)；

F_addRepresents addition by space, m'_sIs M'_SThe s channel of (a), (b), f)_sIs F_COf s channel, f'_sIs F_SThe s channel of (1).

Further, the dual attention module is organized using a serial-in-sequence approach using additive operations in the broadcast attention map process, and the dual attention map mechanism is embedded in the convolutional layers of the ResNet-50.

Compared with the prior art, the invention has the following advantages:

1. the double-attention-machine system modeling method is capable of not only modeling the relation between the characteristic channels, but also improving the expression capacity of the characteristic diagram at the pixel level.

2. A single-stage pedestrian detector CSANT based on a double-attention machine mechanism is constructed, and performance influence factors in the CSANT are further analyzed through experiments.

3. CSANT achieves the latest performance of a Caltech reference and the competitive performance of a CityPersons reference while maintaining the computational efficiency.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 shows the overall architecture of CSANET in accordance with the present invention. The device mainly comprises two parts, a backbone network module and a detection head. An example of a dual attention module embedded in the Resnet-50 is shown in a dashed box, which in turn combines a Channel Attention Model (CAM) and a Spatial Attention Model (SAM).

FIG. 2 is a network structure of the channel attention module and the spatial attention module of the present invention. The feature map is shown in feature dimensions, such as H × W × C for a height H, width W and channel C, representing a broadcast operation that adds element-by-element.

Fig. 3 shows the comparison result of CSANet of the present invention and other advanced pedestrian detector models under the condition that the threshold value is IoU ═ 0.5, which is shown as FPPI curve.

Fig. 4 shows the comparison result of CSANet of the present invention and other advanced pedestrian detector models under the condition that the threshold value is IoU-0.75, which is shown as FPPI curve.

FIG. 5 is a model visualization of the present invention comparing a visualization using a dual attention model with a visualization without a dual attention model.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

As shown in the figure, the invention provides a pedestrian detection method based on a convolutional neural network and a double-attention machine mechanism, which is characterized by comprising the following steps of:

s1: input is from Caltech^[1]Data sets and CityPersons^[2]An image of the data set;

s2: the convolution neural network based on the double-attention machine system is used as a main network to extract image characteristics, and the detection part classifies and regresses the characteristics;

s3: and the pedestrian is framed out in the form of a frame.

The overall framework of CSANet is shown in fig. 1, with the backbone network being a ResNet-50 embedded with dual attention modules. The detection head module mainly comprises three convolution layers which respectively predict the center position, the scale and the offset of the pedestrian. ResNet-50 is divided into 5 stages, and the output characteristic graphs of 2 to 5 stages are respectively defined as

And

the feature map of stage2-5 was downsampled by 4, 8, 16, and 16, respectively. Wherein, the low-level feature map can provide more accurate position information, and the deeper feature map contains more semantic information. And (4) connecting the multi-scale feature maps of each stage in series in a simple manner to obtain a fused feature map. Before feature map fusion, the resolution of the output feature maps at each stage is unified using a deconvolution operation. In general, shallow features are referred to as generic features, while the semantic information expressed by each channel of deep features is category-specific.

Taking the third residual block of phase 5 as an example, the process of the dual attention network broadcast attention map is as follows:

given the output of the residual block

It is defined as the original feature map. The CAM and SAM are derived in turn to derive a 1D channel attention map and a 2D spatial attention map. The original features are sequentially re-scaled by two attention figures. The calculation process of the two calibration feature maps can be summarized as follows:

here, the

Representing addition per pixel, M_CIs a channel attention map, M_CIs a spatial attention map. F_CIs passing through M_CCalibrated characteristic map, F_SIs passing through M_SAnd (5) calibrating the characteristic diagram.

Fig. 2(a) shows a network structure of the channel attention module. To compute an effective channel attention map, the 2D feature map is compressed into real numbers along the spatial axis direction of the feature map. First aggregating raw feature maps using global pooling operations

To obtain a channel attention map

The whole calculation process is as follows:

here, GAP denotes a global average pooling operation, f_cRepresenting the original feature map F, u_cRepresenting the channel characteristicsThe c-th real number of graph U.

Then will be

The whole calculation process is expressed as:

M_C＝σ(W₂(W₁U)),(20)

here, the two fully-connected layers are for better fitting the complex correlation between channels, representing the effect of the activation function ReLU, σ representing sigmoid activation, W representing scaling parameters in the fully-connected layers, including

And

r is the compression ratio and is set to 16.

Finally, M is used_CFor input characteristic diagram

And carrying out characteristic recalibration. First step, M_CThe attention map, which is broadcast to the same dimension as F, is shown as

Then M 'is processed by pixel addition operation'_CBroadcasting the characteristic diagram F to obtain a calibrated characteristic diagram F_CThe whole calculation process is as follows:

f′_c＝F_add(m′_c,f_c)＝m′_c+f_c.(21)

here, F_addDenotes addition by channel, m'_cIs a characteristic map M'_CThe c channel of (f)_cIs the c-th channel of the original feature map F. f'_cIs F_COf the c channel, m'_cRepresents M'_CGlobal information of the c-th channel.

Fig. 2(b) shows a network structure of the channel attention module. To compute an effective channel attention map, the 3D feature map is compressed along the channel axis direction of the feature map into 2D feature channels. Firstly, a characteristic diagram after the attention diagram calibration of a given channel is given

Feature map acquisition using average pooling operations

The calculation process is as follows:

where AP denotes an average pooling operation, f_ijIs f'_cThe pixel value of the point of intermediate ij. C represents the number of characteristic channels, v_ijThe pixel values at the ij point in the feature map V are shown.

Then, a convolution layer with convolution kernel size of 7 × 7 and moving step size of 1 is used for convolving the feature map V, and then the sigmoid activation function is used for obtaining the attention map

The calculation process is as follows:

M_S＝σ(f^7×7(V)),(23)

where σ denotes the activation function sigmoid action, f^7×7Representing a convolution operation with a convolution kernel of 7 × 7.

Finally, M is used_SRecalibration profile F_CFirst step M_SIs broadcast as sum F_CA feature map with the same dimension is expressed as

The whole calculation process is as follows:

f′_s＝F_add(m′_s,f_s)＝m′_s+f_s.(24)

here, F_addRepresents addition by space, m'_sIs M'_SThe s channel of (a), (b), f)_sIs F_CThe s channel of (1). f'_sIs F_SThe s channel of (1).

Spatial attention seeks to integrate the global context information of the feature map. Furthermore, spatial attention seeks to focus on the global information of each channel, increasing the size of the receptive field, and allowing the CNN to capture image information from a global perspective.

The channel attention module and the spatial attention module may be embedded in the ResNet-50 in a parallel or serial fashion. The channel attention module focuses on important channels, and the spatial attention module focuses on important regions of the feature map. The correct combination of the two attention modules may maximize the effectiveness of the attention mechanism.

Table 1 is a discussion of the generation method of the fusion profile in the ablation experiments. Wherein the column of Extraction methods is the Extraction method of each feature map, e.g. stage2-5 indicates that the double attention network is embedded in ResNet-50^[3]

Stages

2, 3, 4 and 5. The column of Channel represents the number of channels of the output profile of the backbone network in the corresponding model. Parameters represent the number of Parameters of the corresponding CSANet model. The Test time represents the time required to Test one picture. The bold font in the table indicates the optimal result in the corresponding column.

TABLE 1

Table 2 is a discussion of the signature fusion method in ablation experiments. In Feature maps

And

are respectively provided withOutput

profiles representing stages

2, 3, 4 and 5 of ResNet-50. The column of Channel represents the number of channels of the output profile of the backbone network in the corresponding model. Parameters represent the number of Parameters of the corresponding CSANet model. The Test time represents the time required to Test one picture. The bold font in the table indicates the optimal result in the corresponding column.

TABLE 2

Table 3 is a discussion of additive and multiplicative broadcasts in ablation experiments. The Description column represents the corresponding model. For example, p3p4p5+ add represents a fusion feature map

And

and using an additive broadcast attention map; p3p4p5+ multiplex denotes the fusion profile

And

and broadcast attention maps using multiplications. The column of Channel represents the number of channels of the output profile of the backbone network in the corresponding model. Parameters represent the number of Parameters of the corresponding CSANet model. The Test time represents the time required to Test the work (of) one picture. The bold font in the table indicates the optimal result in the corresponding column.

TABLE 3

Table 4 is a discussion of the connection of the channel and spatial attention module in ablation experiments. The Description column represents the corresponding model. CAM + SAM means that the Channel Attention Module (CAM) and the Spatial Attention Module (SAM) are connected in series. CAM// SAM means that CAM and SAM are connected in series. SAM + CAM means that the Spatial Attention Module (SAM) and the Channel Attention Module (CAM) are in series. The column of Channel represents the number of channels of the output profile of the backbone network in the corresponding model. Parameters represent the number of Parameters of the corresponding CSANet model. The Test time represents the time required to Test one picture. The bold font in the table indicates the optimal result in the corresponding column.

TABLE 4

Table 5 compares the most advanced detectors on CityPersons when IoU ═ 0.5. The Hardware column represents the GPU equipment used for network training, while the Scale column represents the number of GPUs. Bold numbers indicate the best results.

TABLE 5

Example (b):

results and analysis of the experiments

(1) Ablation experiment

As shown in Table 1, the embedding method of stage3-5 achieved an MR of 3.88% at IoU ═ 0.5^-2The best performance of (1). At IoU ═ 0.75, the stage2-5 embedding method improved performance by about 36% compared to the stage5 embedding method. Notably, at IoU ═ 0.5, stage2-4 and stage5 had comparable performance, representing 4.28% MR respectively^-2And 4.27% MR^-2. However, at the threshold of IoU-0.75, the performance difference between them is large, 4.77% MR^-2. This comparison shows that the double attention network is more favorable for regression of high quality bounding boxes.

As can be seen from Table 2, the model fused with the low-level features has poor accuracy, but the model has less parameter quantity and higher detection speed. As the number of the fused feature maps is increased, the detection precision of the model is improved. In the case of IoU ═ 0.5, fusion

And

the model of (2) has, with the fused model, a significant improvement of 47%, but the least accurate. At IoU ═ 0.75,

and

the fusion effect of (2) is optimal. Generally, deeper features are helpful for pedestrian detection, but they take up more operating memory.

As can be seen from table 3, the model using the additive broadcast is better than the model using the multiplicative broadcast. MR between p3p4p5+ add and p3p4p5+ multiply^-2The difference is about 37%. In both the second and third set of experiments, the difference was about 17%. Furthermore, it was found that the broadcast method of attention-seeking hardly affects the test time. In practice, runtime is mainly affected by model parameters.

In fact, multiplication is much more computationally complex than addition. Although the multiplication enhances the useful information representation of the feature map, it also over-amplifies the effect of noise. In addition, the multiplicative weighting operation unduly suppresses some contextual details, which is detrimental to locating pedestrians. In addition to accuracy, the pedestrian detection task also needs to consider the real-time indicators of the model. The multiplication increases the running time of the network to some extent.

As shown in Table 4, CAM + SAM indicates that the channel attention module and the spatial attention module are connected in a sequential manner. Analysis shows that sequential arrangement has better results than parallel arranged models. The best result of the CAM-first mode is 3.88% MR^-2. CAM// SAM indicates that the two modules are arranged in parallel, with a performance 0.27% lower than CAM + SAM^-2. In the third mode, SAM + CAM SAM takes precedence in the dual attention network, and the model has the worst performance, 4.57% MR^-2。

(2) Visualization experiment

As shown in fig. 5, the experimental results are visualized for the model. Using a visualization algorithm Grad-CAM^[4]To qualitatively explain the model for CSANet. The interpretability of CNN has been improved to a certain level. The algorithm may derive a class activation map that may be used to locate regions of classes in the image. The Grad-CAM uses mainly the gradient of the last convolutional layer of the network to generate a heatmap, which can highlight important pixels in the input image.

As can be seen from fig. 3, the thermodynamic diagram of the model with dual attention networks covers the pedestrian area better than the model without dual attention networks. In other words, the dual attention network may better focus on the pixel information of the target area. The visualization result qualitatively shows that the pixel expression capability of the target area is enhanced to a certain extent in the improved feature map.

(3) Comparative experiments on SOTA

Fig. 3(IoU ═ 0.5) and fig. 4(IoU ═ 0.75) are comparisons of performance of the advanced pedestrian detection algorithm on the Caltech dataset, respectively. The algorithm includes DeepParts^[5]、MS-CNN^[6]、SA-FasterRCNN^[7]、RPN+BF^[8]、FasterRCNN+ATT^[9]、SDS-RCNN^[10]、AdaptFasterRCNN^[2]、CSP+City^[11]、CSANet^[ours]And CSANET + City^[ours]。

As shown in FIG. 3, the model initialized from the City Persons dataset has the best performance, compared to the current high performance method CSP + City, CSANT has 3.55% MR^-2The best performance of (1). CSANET model initialized with ImageNet dataset exceeded 3.88% MR^-2This indicates a significant improvement over the baseline model. As shown in fig. 4, the CSANet model also achieves a smaller loss rate at a tighter threshold setting, which means that the dual attention network also helps to improve the quality of the bounding box.

Table 5 shows the comparison of advanced pedestrian detectors on the CityPersons dataset. This set of experiments was at IoU ═ 0.5, and only a single NVIDIA GTX 1080Ti GPU and mini-batch ═ 2 were used in training the network. Table 5 shows that the CSANET detector is in CityPersons achieved 7.25% of the latest performance on the Bare subset. On a reasonable subset, CSANet's performance is only inferior to CSP trained with mini-batch ═ 8^[11]And (4) modeling. The CSANET detector can be connected with ALFNet^[12]A comparison with a competitor Reploss^[13]Compared with the prior art, the quality can be improved by about 2.6%. In fact, a larger batchsize will, within reasonable limits, make the gradient descent direction more accurate.

Reference to the literature

[1]P.Dollar,C.Wojek,B.Schiele,and P.Perona,“Pedestrian detection:Anevaluation of the state of the art,”IEEE Trans.Pattern Anal.Mach.Intell.(PAMI),vol.34,no.4,pp.743-761,Apr.2012.

[2]S.Zhang,R.Benenson,and B.Schiele,“Citypersons:A diverse datasetfor pedestrian detection,”in Proc.IEEE Comput.Vis.Pattern Recognit.(CVPR),Jul.2017,pp.3213-3221.

[3]K.He,X.Zhang,S.Ren,et al.,“Deep Residual Learning for ImageRecognition,”in Proc.IEEE Comput.Vis.Pattern Recognit.(CVPR),June,2016,pp.770-778.

[4]R.R.Selvaraju,M.Cogswell,et al.,“Grad-cam:Visual explanations fromdeep networks via gradient-based localization,”in Proc.IEEEInt.Conf.Comput.Vis.(ICCV),2017,pp.618-626.

[5]Y Tian,P Luo,X Wang,et al.,“Deep learning strong parts forpedestrian detection,”in Proc IEEE Int.Conf.Comput.Vis.(ICCV),Dec.2015,pp.1904-1912.

[6]Z.Cai,Q.Fan,R.S.Feris,and N.Vasconcelos,“A unified multi-scaledeep convolutional neural network for fast object detection,”inProc.Eur.Conf.Comput.Vis.(ECCV),2016,pp.354–370.

[7]J.Li,X.Liang,S.Shen,T.Xu,J.Feng,and S.Yan,“Scale-aware fast R-CNNfor pedestrian detection,”IEEE Trans.Multimedia(ICME),vol.20,no.4,pp.985-996,Apr,2017.

[8]L.Zhang,L.Lin,X.Liang,and K.He,“Is faster r-cnn doing well forpedestrian detection？,”in Eur.Conf.Comput.Vis.(ECCV),Oct.2016,pp.443-457.

[9]S.Zhang,J.Yang,and B.Schiele,“Occluded pedestrian detectionthrough guided attention in CNNs,”in Proc.IEEE conf.Comput.Vis.PatternRecognit.(CVPR),Jun.2018,pp.6995-7003.

[10]G.Brazil,X.Yin,X.Liu,“Illuminating pedestrians via simultaneousdetection&segmentation,”in Proc IEEE Int.Conf.Comput.Vis.(ICCV),2017,pp.4950-4959.

[11]W.Liu,S.Liao,W.Ren,W.Hu,and Y.Yu,“High-level Semantic FeatureDetection:A New Perspective for Pedestrian Detection,”in Proc.IEEEComput.Vis.Pattern Recognit.(CVPR),Jun.2019,pp.5187-5196.

[12]W.Liu,S.Liao,W.Hu,X.Liang,and X.Chen,“Learning efficient single-stage pedestrian detectors by asymptotic localization fitting,”inProc.Eur.Conf.Comput.Vis.(ECCV),2018,pp.618-634.

[13]X.Wang,T.Xiao,Y.Jiang,S.Shao,J.Sun,and C.Shen,“Repulsion loss:Detecting pedestrians in a crowd,”in Proc.IEEE Comput.Vis.Pattern Recognit.(CVPR),Jun.2018,pp.7774-7783.

[14]T.Song,L.Sun,D.Xie,H.Sun,and S.Pu,“Small-scale pedestriandetection based on topological line localization and temporal featureaggregation,”in Proc.Eur.Conf.Comput.Vis.(ECCV),2018,pp.536-551.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units may be a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. The pedestrian detection method based on the convolutional neural network and the double-attention machine mechanism is characterized by comprising the following steps of:

s1: inputting images from the Caltech dataset and the CityPersons dataset;

s2: a convolutional neural network based on a double-attention machine system is used as a main network to extract image features, and a detection part classifies and regresses the features;

s3: and the pedestrian is framed out in the form of a frame.

2. The convolutional neural network and dual-attention mechanism-based pedestrian detection method of claim 1, wherein:

the mathematical modeling process of the convolutional neural network based on the double-attention mechanism mainly comprises the following steps of:

s21: given the output of the residual block

wherein the content of the first and second substances,

s22: compressing a 2D feature map into real numbers along a spatial axis direction of the original feature map; aggregating raw feature maps by using global pooling operations

To obtain a channel attention map

s23: will be provided with

M_C＝σ(W₂(W₁U)) (4)；

And

r is the compression ratio set to 16;

s24: using M_CFor input characteristic diagram

f_c′＝F_add(m′_c,f_c)＝m′_c+f_c. (5)；

Wherein, F_addDenotes addition by channel, m'_cIs a characteristic map M'_CThe c channel of (f)_cIs the c-th channel of the original feature map F; f. of_c' means F_COf the c channel, m'_cRepresents M'_CGlobal information of the c-th channel;

s25: to compute an effective channel attention map, the 3D features are mapped along the channel axis of the feature mapCompressing the characteristic diagram into a 2D characteristic channel; characteristic diagram after calibration of given channel attention diagram

Feature map acquisition using average pooling operations

Comprises the following steps:

where AP denotes average pooling operation, f_ijDenotes f_c' the pixel value of the point ij in, C represents the number of characteristic channels, v_ijA pixel value representing an ij point in the feature map V;

s26, convolving the feature map V by the convolution layer with the convolution kernel size of 7 × 7 and the moving step size of 1, and then using the sigmoid activation function to obtain the attention map

M_S＝σ(f^7×7(V)) (7)；

s27: using M_SRecalibration profile F_CFirst step M_SIs broadcast as sum F_CA feature map with the same dimension is expressed as

f_s′＝F_add(m′_s,f_s)＝m′_s+f_s. (8)；

F_addRepresents addition by space, m'_sIs M'_SThe s channel of (a), (b), f)_sIs F_CThe s channel of (a), (b), f)_s' is F_SThe s channel of (1).

3. The convolutional neural network and dual-attention mechanism-based pedestrian detection method of claim 1, further characterized by:

the dual attention module is organized using a serial-in-sequence approach using additive operations in the broadcast attention map process, and the dual attention map mechanism is embedded in multiple convolutional layers of the ResNet-50.