CN115424104A - Target detection method based on feature fusion and attention mechanism - Google Patents

Target detection method based on feature fusion and attention mechanism Download PDF

Info

Publication number
CN115424104A
CN115424104A CN202210998016.0A CN202210998016A CN115424104A CN 115424104 A CN115424104 A CN 115424104A CN 202210998016 A CN202210998016 A CN 202210998016A CN 115424104 A CN115424104 A CN 115424104A
Authority
CN
China
Prior art keywords
feature
layer
module
detection
fusion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210998016.0A
Other languages
Chinese (zh)
Inventor
周慧鑫
张伟鹏
戴加乐
于跃
宋江鲁奇
张嘉嘉
李苗青
张喆
滕翔
王珂
李欢
朱贺隆
梅峻溪
王财顺
石志鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN202210998016.0A priority Critical patent/CN115424104A/en
Publication of CN115424104A publication Critical patent/CN115424104A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Processing (AREA)

Abstract

An infrared image is input into a MobileNet network to carry out convolution calculation layer by layer to obtain feature maps with different scales, a bidirectional feature fusion module IBFPN is established, and the feature pyramid image used for detection is input into the IBFPN to carry out mutual fusion of upper-layer information and lower-layer information; after IBFPN, inputting each fusion characteristic layer into an attention module ECBAM, and endowing different characteristics with different specific gravities through the ECBAM; sending feature maps of different scales processed by IBFPN and ECBAM into a detection module for detection, and acquiring the category and corresponding boundary frame of each candidate frame to obtain a prediction result; and carrying out non-maximum value suppression on the prediction result, and deleting redundant target frames to obtain a final detection result. After comprehensively comparing various detection algorithms, experimental results show that the method can effectively improve the detection precision of the infrared target.

Description

Target detection method based on feature fusion and attention mechanism
Technical Field
The invention belongs to the field of infrared target detection, and particularly relates to a target detection method based on feature fusion and attention mechanism.
Background
The target detection technology has been the focus of research and attention as an important branch in the fields of image processing and computer vision. Object detection, also called object extraction, essentially classifies and pinpoints multiple objects in a given image. Different from visible light imaging, infrared imaging is a passive imaging technology, performs imaging by detecting infrared heat radiation emitted by an object, and has the advantages of long detection distance, strong penetrating power, full-time working and the like. Therefore, the infrared target detection technology combining target detection and an infrared imaging system can effectively make up for the deficiency of visible light detection, and further meet the requirements of different industries on the detection technology.
In recent years, the deep learning algorithm has excellent performance in the field of image processing, and different from the traditional manual design characteristics, the deep learning algorithm automatically extracts the characteristics through a convolutional neural network, and has good robustness and portability. Therefore, the infrared target detection algorithm based on deep learning has important research value. The SSD detection algorithm utilizes a backbone network and an additionally added convolution layer to generate 6 groups of feature maps with different scales so as to predict the candidate frame types and coordinates. The depth of the convolutional layer is different, and the image information included in the feature map is also different. The shallow feature map contains information such as texture and position of the image, but the extracted features are not comprehensive; the deep feature map contains richer semantic information, but loses much detail in the convolution process. The SSD only inputs the branch information into the detection module independently, and the characteristics of each layer are not mutually connected and supplemented. In addition, target information in the infrared image is not obvious and the characteristic structure is sparse, so that the phenomena of target missing detection, false detection and the like are easily caused in a complex background environment.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention aims to provide a target detection method based on feature fusion and attention mechanism, which improves the traditional SSD detection algorithm, strengthens the expression capability of detection branches, improves the network expression capability, and inhibits the influence caused by irrelevant information, thereby solving the problems of missed detection and false detection caused by complex background and the like.
In order to achieve the aim, the invention adopts the technical scheme that:
a target detection method based on feature fusion and attention mechanism comprises the following steps:
step 1, inputting the infrared image into a MobileNet network to perform layer-by-layer convolution calculation to obtain feature maps with different scales, wherein 6 layers of feature maps of DWS11, DWS13, DWS14_2, DWS15_2, DWS16_2 and DWS17_2 are feature pyramid images for detection;
step 2, establishing a bidirectional feature fusion module IBFPN, and inputting the feature pyramid image for detection into the IBFPN to perform mutual fusion of upper and lower layer information;
step 3, after IBFPN, inputting each fusion feature layer into an attention module ECBAM, and endowing different features with different specific gravities through the ECBAM;
step 4, sending DWS11, DWS13, DWS14_2, DWS15_2, DWS16_2 and DWS17_2 processed by IBFPN and ECBAM into a detection module for detection, and acquiring the category and the corresponding boundary frame of each candidate frame to obtain a prediction result;
and 5, performing non-maximum value suppression on the prediction result, and deleting redundant target frames to obtain a final detection result.
In an embodiment, in the step 2, for each feature map used for detection, compared with a conventional SSD algorithm that directly detects the feature map, each feature map is input into the bidirectional feature fusion module IBFPN before detection, so as to perform mutual fusion of upper and lower layer information. The IBFPN is based on the traditional bidirectional pyramid network, a residual error feature enhancement module RFA is constructed to enhance top-level features, a bottom-up fusion path is introduced, and meanwhile, a connection from an initial input to an output is added for the same level.
In an embodiment, in step 2, the step of fusing the upper and lower layer information with each other is as follows:
step 2.1, in the forward propagation process, pyramid feature levels generated by the traditional SSD target detection model are sequentially { C1, C2, C3, C4, C5 and C6}, and respectively correspond to DWS11, DWS13, DWS14_2, DWS15_2, DWS16_2 and DWS17_2 in the step one;
2.2, inputting the characteristics of the C6 layer into an RFA residual error characteristic enhancement module to improve the characteristic representation, and obtaining an output characteristic layer R6 with multi-scale context information, namely RFA (C6);
step 2.3, adding and fusing the R6 and the C6 subjected to 1 × 1 convolution dimensionality reduction to obtain a top-layer feature C6_ td;
step 2.4, performing 1 × 1 convolution dimensionality reduction on the C1-C5 to obtain features C5_ in-C1 _ in;
and 2.5, fusing the C1_ in-C6 _ in from top to bottom by using the following formula:
C6_td=Conv(C6)+RFA(C6)
C5_td=Conv(C5)+Resize(C6_td)
Figure BDA0003806409080000031
C1_td=Conv(C1)+Resize(C2_td)
wherein, conv is a1 × 1 convolution operation for reducing dimensions of a feature channel, conv (C6) -Conv (C1) are C6_ in-C1 _ in, RFA (x) is residual feature enhancement on a feature map, resize is an up-sampling operation adopted for matching resolutions of different layers, and "+" represents addition of corresponding position elements; c1_ td-C6 _ td are top layer characteristics obtained after addition and fusion;
and 2.6, fusing the low-level features to the high level from bottom to top, so that each level has not only high-level strong semantic information but also low-level strong detail positioning information, and adding a connection from initial input to output to fuse more features without increasing the number of parameters to obtain fused feature layers C1_ out-C6 _ out, wherein the fusion formula is as follows:
C1_out=C1_td=C1_in+Resize(C2_td)
C2_out=C2_td+Resize(C1_out)+C2_in
Figure BDA0003806409080000032
C6_out=C6_td+Resize(C5_out)+C6_in
c1_ out-C6 _ out is a multi-scale feature fusion feature layer output after C1-C6 pass through IBFPN.
In one embodiment, the residual feature enhancement module RFA combines the concept of residual feature enhancement in AugFPN in the conventional Res residual structure and incorporates an Adaptive Spatial Fusion module (ASF). In both the FPN and the bi-directional feature pyramid, the features at the highest level are propagated from top to bottom upsampled and gradually merged with the features at the lower levels. In the process, the features of the lower layer are enhanced by the semantic information from the higher layer, and the fused features naturally contain different context information; however, due to resizing, the channel dimensions of the top-level features are compressed, which results in the loss of part of the information. For this situation, the RFA designed by the present invention adds context information by using residual features to reduce the loss of the highest-level features and improve the performance of the pyramid.
Illustratively, the upsampling operation is a bilinear interpolation.
In one embodiment, the Attention Module ECBAM is modified based on a Convolutional Attention Module (CBAM). The module mainly comprises two modules of channel attention and space attention, wherein the channel attention module focuses on importance degrees of different characteristic channels, and comprises an average pooling layer AvgPool and a maximum pooling layer MaxPool, a one-dimensional convolution module and a Sigmoid mapping module, the space attention module focuses on importance degrees of characteristics in different spaces, information parts are focused on 'where', the module is complementary with channel attention, and the module comprises the average pooling layer AvgPool, the maximum pooling layer MaxPool, a stacking module and the mosoid mapping module.
In one embodiment, the step 3, assigning different specific gravities to different features, is as follows:
step 3.1, inputting each fused feature layer in the step 2 into ECBAM, and setting the input fused feature layers (the output C1_ out-C6 _ out of the step 2) as
Figure BDA0003806409080000041
H, W is the height and width of each fused feature layer, and C is the number of channels of the fused feature layers; for the
Figure BDA0003806409080000042
Firstly, two characteristic graphs are obtained through average pooling and maximum pooling compression respectively
Figure BDA0003806409080000043
And
Figure BDA0003806409080000044
step 3.2, ECBAM considers the interaction of each channel and its adjacent k channels, pair I avg 、I max Respectively using one-dimensional convolution with the size of k to carry out operation;
step 3.3, element addition is carried out on the two parts of results processed in the step 3.2, and each channel feature is mapped into (0,1) through a Sigmoid activation function, so that the weight coefficients M of different channels are obtained c (I) The calculation method is as follows:
Figure BDA0003806409080000045
wherein, the sigma is a Sigmoid activation function,
Figure BDA0003806409080000046
is a one-dimensional convolution of size k, the size of k being adaptively determined by:
Figure BDA0003806409080000047
wherein C represents a fusion feature layer
Figure BDA0003806409080000051
Number of channels, | purple odd Represents the nearest odd number of the result, γ =2, b =1;
step 3.4, adding M c And merging feature layers
Figure BDA0003806409080000052
Multiplying to obtain a channel attention feature map I',
Figure BDA0003806409080000053
step 3.5, inputting the new input feature map I 'into a space attention module, and respectively carrying out average pooling and maximum pooling on all channels of the same feature point I' to obtain
Figure BDA0003806409080000054
And
Figure BDA0003806409080000055
step 3.6, mixing
Figure BDA0003806409080000056
And
Figure BDA0003806409080000057
performing standard convolution after stacking;
step 3.7, obtaining the spatial weight M of I' by using Sigmoid activation function s (I'), the calculation procedure is as follows:
M s (I′)=σ(f 7×7 ([avgPool(I′);maxPool(I′)])
=σ(f 7×7 ([I avg ;I′ max ])
wherein f is 7×7 Is a convolution kernel of size 7 x 7.
Step 3.8, the obtained space weight M s (I') and channel attention profile
Figure BDA0003806409080000058
And multiplying to obtain a final feature map which can be sent for detection, wherein the feature maps at the moment are the feature maps of DWS11, DWS13, DWS14_2, DWS15_2, DWS16_2 and DWS17_2 after IBFPN and ECBAM processing.
Compared with the prior art, the invention has the following beneficial effects:
the SSD detection algorithm based on the feature fusion and attention mechanism designed by the invention surpasses other algorithms in precision. Compared with the original SSD _ VGG16 algorithm, the detection precision and the algorithm efficiency of the algorithm are respectively improved by 3.04% and 0.9FPS; compared with the SSD _ MobileNet algorithm, the detection precision of the algorithm is improved by 5.15 percent; the precision of the DSSD algorithm is close to that of the algorithm, but the detection speed of the DSSD algorithm is lower than that of the algorithm.
Drawings
Fig. 1 is a schematic diagram of a feature pyramid network FPN and a local structure.
Fig. 2 is a schematic diagram of a residual feature enhancement module.
Fig. 3 is a schematic structural diagram of an IBFPN module with bidirectional feature pyramid.
Fig. 4 is a schematic diagram of a convolution attention module ECBAM structure.
Fig. 5 is a schematic structural diagram of a channel attention module.
Fig. 6 is a schematic structural diagram of a spatial attention module.
Fig. 7 shows the detection results of the present invention for vehicles and pedestrians.
FIG. 8 shows the detection results of the present invention under the condition that the vehicle environments are shielded from each other.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The method mainly researches targets in an infrared scene, and adopts an infrared data set FLIR to analyze and verify the algorithm in order to demonstrate the effectiveness of the method. The Ubuntu16.04 operating system is used in all experiments, and the NVIDIA-GTX1050Ti display card is loaded. Algorithms are written by python3.6.6, cuda9.0.136 is used for developing an architecture in parallel, and the construction, training and testing of algorithm models are carried out through a Pytrch framework. The overall training process of the detection model is divided into two stages: a freezing phase and a thawing phase. Freezing the main network of the model when the parameters are trained in the freezing stage, wherein the feature extraction network cannot be changed, the quantity of batch _ size of each batch is 16, the learning rate is 0.0005, and 50 epochs are trained; in the thawing training phase, the whole model is trained, all parameters are changed, the quantity of each batch, batch _ size, is 8, the learning rate is 0.0001, and 100 epochs are trained in total. The basic flow is as follows:
step 1, inputting an infrared image into a traditional lightweight network MobileNet to perform layer-by-layer convolution calculation, and obtaining feature maps with different scales, wherein 6 layers of feature maps of DWS11, DWS13, DWS14_2, DWS15_2, DWS16_2 and DWS17_2 are feature pyramid images for detection;
and 2, the SSD inputs the image into a feature extraction network in a pyramid feature level detection mode and carries out forward propagation to obtain feature maps with different scales, and then, the feature maps are respectively predicted. The method naturally generates the features with different scales through convolution without extra calculation, but each prediction only utilizes the features of one layer, and no information exchange and supplement between the features of different layers are realized. Specifically, the depth of the convolution layer in the SSD detection algorithm is different, and the image information included in the feature map is also different. The shallow feature map contains information such as texture and position of the image, but the extracted features are not comprehensive; the deep feature map contains richer semantic information, but loses much detail in the convolution process. The traditional SSD detection algorithm only inputs each branch information into the detection module independently, and the characteristics of each layer are not related and supplemented. The shallow feature can be strengthened layer by adding an FPN structure in the SSD algorithm, but the position information contained in the shallow feature is not effectively transmitted to the deep feature.
Therefore, the invention establishes the bidirectional feature fusion module IBFPN capable of effectively transmitting shallow information on the basis of the original pyramid feature level. And inputting the characteristic pyramid image for detection into the IBFPN to perform mutual fusion of upper and lower layer information, thereby improving the detection performance of the target.
And 2.1, continuously sampling the top layer features, fusing the top layer features with the features of other layers to obtain a reinforced feature map, reconstructing a new feature pyramid layer by the feature map, and inputting the feature pyramid layer into a detection module to predict the category and the position of the target. As shown in fig. 1, the FPN structure can be divided into two parts, i.e. a bottom-up path from the left side in the figure, and a top-down path from the right side and a horizontal connection of the middle layer.
The bottom-up path is the process of forward convolution operation of the underlying network. In the process, the feature map passes through a series of convolution kernels, after the convolution kernel with the step size of 2, the size of the feature map is reduced to half of the original size, and after the convolution kernel with the step size of 1, the size of the feature map is kept unchanged, so that adjacent convolution layers with the same size are regarded as the same stage. Because the semantic information extracted by the network is richer as the convolution is deepened, the last layer of feature map at each stage is used as a representative of the scale features, and the representative feature maps with different scales form a pyramid level, as shown by C1, C2 and C3 in FIG. 1.
The top-down process is a process of progressively fusing deep features with shallow features, as shown at C4, C5, and C6 in fig. 1, as opposed to the bottom-up process. The method specifically comprises the following steps: firstly, amplifying the size of the high-level feature with stronger feature through up-sampling to make the size of the high-level feature be the same as that of the next-level feature map; then, the feature layer to be fused is convolved by 1 × 1 to adjust the number of channels to be consistent with the high-level feature map, and then the two parts of features are transversely connected to add elements to obtain the fused feature map. And repeating the operation on the other feature layers to be fused after the feature fusion operation of the two adjacent layers is finished, so as to obtain a new feature pyramid layer.
And 2.2, in the characteristic pyramid, the characteristics of the highest layer are all subjected to top-down upsampling propagation and are gradually fused with the characteristics of the lower layer. In the process, the features of the lower layer are enhanced by the semantic information from the higher layer, and the fused features naturally contain different context information; however, due to resizing, the channel dimensions of the top-level features are compressed, which results in the loss of part of the information. When designing a Feature fusion module, the invention combines the concept of Residual Feature enhancement in AugFPN in the traditional Res Residual structure, and an introduced Residual Feature enhancement module (RFA) adds different context information to an original Feature layer by using Residual branches to reduce the loss of the highest-layer Feature and improve the performance of a pyramid.
Step 2.3, considering aliasing influence caused by interpolation, simple addition operation cannot be performed on the context features of each part, so an Adaptive Spatial Fusion (ASF) module is added to the RFA residual feature enhancement module, and the specific structure of the ASF module is shown in fig. 2 right, and the context features can be better combined through the ASF module. Specifically, the module takes the features of the upsampled parts as input, generates a spatial weight for each feature through operations such as splicing, convolution and activation, and aggregates the contextual features into a new feature layer R by using the weights, wherein the feature layer has multi-scale contextual information. After the ASF generates the aggregation feature layer R, the R and the highest layer feature are added in a summation mode to strengthen the scale information, and then the strengthened highest layer feature is used for subsequent scale transformation and feature fusion.
Step 2.4, obviously, the original SSD algorithm does not complement each other between the detection branches at this time, based on the above ideas, the present invention designs a new bidirectional Feature fusion module, that is, an Improved Bidirectional Feature Pyramid Network (IBFPN), which is an Improved module based on the conventional bidirectional Feature Pyramid Network, introduces RFA to strengthen the top-level Feature, introduces a bottom-up fusion path, and adds a connection from the beginning to the output for the same level, as shown in fig. 3 for each Feature graph used for detection. The RFA module introduced at the top of the figure three can input C6 layer features into the module to improve its feature representation, resulting in R6 with multi-scale context information. The right part represents a bottom-up path to further fuse the low-level features into the high level, so that each level not only has high-level strong semantic information, but also has low-level strong detail positioning information. In addition, the structure designed by the invention also aims at the same level, and a connection from the initial input to the output is newly added, as shown by a red connecting line in fig. 3, so that more features can be fused without increasing the number of parameters. Compared with the traditional SSD algorithm which directly detects the feature maps, the method inputs all the feature maps into the bidirectional feature fusion module IBFPN before detection so as to perform mutual fusion of upper-layer information and lower-layer information.
Step 2.6, specifically, the pyramid feature levels generated by the conventional SSD object detection model in the forward propagation process are sequentially { C1, C2, C3, C4, C5, C6}, and respectively correspond to DWS11, DWS13, DWS14_2, DWS15_2, DWS16_2, and DWS17_2 in step one.
And 2.7, inputting the characteristics of the C6 layer into RFA to improve the characteristic representation, and obtaining R6 with multi-scale context information through an ASF module, namely RFA (C6).
And 2.8, adding and fusing the R6 and the C6 subjected to the 1 × 1 convolution dimensionality reduction to obtain a top-layer feature C6_ td. Performing 1 × 1 convolution dimensionality reduction on C1-C5 to obtain features C5_ in-C1 _ in;
and 2.9, fusing the C1_ in-C6 _ in from top to bottom by using the following formula.
C6_td=Conv(C6)+RFA(C6)
C5_td=Conv(C5)+Resize(C6_td)
Figure BDA0003806409080000091
C1_td=Conv(C1)+Resize(C2_td)
Wherein, conv is a1 × 1 convolution operation for reducing dimensions of a feature channel, conv (C6) -Conv (C1), i.e., C6_ in-C1 _ in, RFA (×) is residual feature enhancement performed on a feature map, resize is an upsampling operation performed to match resolutions of different layers, here, bilinear interpolation, and "+" indicates addition of corresponding position elements. C1_ td-C6 _ td are top layer characteristics obtained after addition and fusion.
And 2.10, newly aggregating a bottom-up fusion path to represent that the bottom-up fusion path further fuses the low-level features into the high level, so that each level has not only high-level strong semantic information but also low-level strong detail positioning information. The invention also adds a new connection from the initial input to the output aiming at the same level, as shown by a red connecting line in figure 3, so that more features can be fused under the condition of not increasing the number of parameters to obtain fused feature layers C1_ out-C6 _ out, and a fusion formula is as follows.
C1_out=C1_td=C1_in+Resize(C2_td)
C2_out=C2_td+Resize(C1_out)+C2_in
Figure BDA0003806409080000101
C6_out=C6_td+Resize(C5_out)+C6_in
C1_ out to C6_ out are feature layers of multi-scale feature fusion output after C1 to C6 pass through IBFPN.
And 3, aiming at the problem that target information in the infrared image is not obvious and the characteristic structure is sparse, the phenomena of target missing detection, false detection and the like are easily caused by a complex background environment, in order to avoid the influence caused by irrelevant information in the original SSD detection algorithm, the invention improves the CBAM to obtain a new mixed attention module ECBAM, and after IBFPN, all the fused characteristic layers are continuously input into the ECBAM, different proportions are given to different characteristics through the ECBAM, so that the model focuses on a target part and focuses on an interested target area more, the network expression capacity is improved, and the influence caused by the irrelevant information is inhibited.
Step 3.1, an original convolution Attention Module (CBAM) mainly includes a channel Attention Module and a space Attention Module, and for the input feature map, the processing of channel Attention and space Attention is sequentially executed, so as to achieve the purpose of giving different specific gravities to features with different degrees of importance. However, during channel attention operation, the CBAM uses full-connection processing of parameter sharing for the results of maximum pooling and average pooling, which may result in a model that does not well consider the mapping of both features at the same time; on the other hand, the fully connected layer takes into account the correlation of global features, thus resulting in too high model complexity and computation. Aiming at the problem, the invention provides an improved convolution attention module according to the idea of local cross-channel interaction.
And 3.2, the ECBAM is obtained by improving on the basis of CBAM. The specific structure of the module is shown in fig. 4, and mainly includes a channel attention module and a space attention module, wherein the channel attention module focuses on the importance degree of different feature channels, and focuses on the information of what is concerned, the space attention module focuses on the importance degree of features in different spaces, and focuses on where is the information part, and the module and the channel attention are complementary. Assume that the input feature map is
Figure BDA0003806409080000102
Pass through the channel attention module in turn
Figure BDA0003806409080000103
And spatial attention module
Figure BDA0003806409080000104
The final processing result can be obtained by the operation of (2).
The specific structure of the channel attention module is shown in fig. 5, and includes an averaging pooling layer AvgPool and a maximum pooling layer MaxPool, a one-dimensional convolution module and a Sigmoid mapping module. In order to highlight the relationship between channels, the length and width of the input image need to be compressed, and the average pooling and the maximum pooling can characterize the importance of different features from different aspects. The one-dimensional convolution module considers the interaction of each channel and the adjacent k channels thereof, reduces the parameter quantity to a constant magnitude while bringing performance gain, and can well consider the mapping of two different characteristics in the pooling layer. The Sigmoid mapping module maps each feature into (0,1).
The specific structure of the spatial attention module is shown in fig. 6, and includes an average pooling layer AvgPool, a maximum pooling layer MaxPool, a stacking module and a Sigmoid mapping module. The pooling and maximal pooling are first averaged over the channel dimension and then the signatures they generate are stitched together. A convolution operation is then used on the stitched feature map to produce the final spatial attention feature map.
In order to give different specific gravity to different characteristics, each fused characteristic layer in the step 2 is respectively input into the ECBAM, and the input fused characteristic layers (the output C1_ out-C6 _ out of the step 2) are set as
Figure BDA0003806409080000111
H, W is the height and width of each fused feature layer, respectively, and C is the number of channels in the fused feature layer. Each fused feature layer passes through the channel attention module in turn
Figure BDA0003806409080000112
And spatial attention module
Figure BDA0003806409080000113
The final processing result can be obtained by the operation of (2). The above calculation of each part can be expressed as the following formula.
Figure BDA0003806409080000114
Figure BDA0003806409080000115
Wherein
Figure BDA0003806409080000116
Representing element-based multiplication, I' is the result after channel attention processing, and I "is the final result.
Step 3.3, for the features in step 2
Figure BDA0003806409080000117
First pass through the flatUniform pooling and maximum pooling compression to obtain two characteristic maps
Figure BDA0003806409080000118
And
Figure BDA0003806409080000119
and 3.4, in the original CBAM, the full connection processing of parameter sharing of the CBAM can cause that the model can not well consider the mapping of two characteristics at the same time, and the complexity and the calculation amount of the model are overhigh. Unlike shared fully-connected operations in CBAM, ECBAM considers the interaction of each channel and its adjacent k channels, pair I avg 、I max The operation is performed using a one-dimensional convolution of size k, respectively, which brings performance gains while reducing the parameter quantities to a constant magnitude.
Step 3.5, element addition is carried out on the two parts of results processed in the step 3.4, and each channel feature is mapped into (0,1) through a Sigmoid activation function, so that the weight coefficients M of different channels can be obtained c (I) The calculation formula is as follows:
Figure BDA0003806409080000121
where σ is the Sigmoid activation function,
Figure BDA0003806409080000122
is a one-dimensional convolution of size k, the size of k being adaptively determined by the following equation.
Figure BDA0003806409080000123
Wherein C represents a fusion feature layer
Figure BDA0003806409080000124
Number of channels, | purple odd Represents the nearest odd number of the result, γ =2,b =1.
In the step 3.6, the step of the method,finally, M is c (I) Merging feature layers with input features
Figure BDA0003806409080000125
Multiplying to obtain a channel attention feature map I',
Figure BDA0003806409080000126
as shown in FIG. 5, I 'is used as a new input feature map and input into the spatial attention module, and average pooling and maximum pooling are respectively performed on all channels of the same feature point of I' to obtain
Figure BDA0003806409080000127
And
Figure BDA0003806409080000128
step 3.7, will
Figure BDA0003806409080000129
And
Figure BDA00038064090800001210
standard convolution is performed after stacking.
Step 3.8, obtaining the spatial weight M of I' by using Sigmoid activation function s (I'), the calculation process can be expressed as the following formula.
M s (I′)=σ(f 7×7 ([avgPool(I′);maxPool(I′)])
=σ(f 7×7 ([I avg ;I′ max ])
Where σ is the Sigmoid activation function, f 7×7 Is a convolution kernel of size 7 x 7.
The obtained spatial weight M s (I') and channel attention profile
Figure BDA00038064090800001211
Multiplying to obtain the final characteristic diagram which can be sent to detection.
And 4, sending the DWS11, DWS13, DWS14_2, DWS15_2, DWS16_2 and DWS17_2 feature maps subjected to IBFPN and ECBAM processing into a detection module for detection, and acquiring the category and the corresponding boundary box of each candidate box to obtain a prediction result.
And 5, performing non-maximum suppression on the prediction result, and deleting redundant target frames to obtain a final detection result.
And 6, carrying out experimental verification on each module, and carrying out verification analysis on the whole detection algorithm.
And 7, verifying an IBFPN module, designing a comparison experiment by respectively adopting the original characteristic pyramid level, the traditional FPN structure, the bidirectional FPN structure and the IBFPN designed by the invention to perform characteristic fusion based on the MobileNet characteristic extraction network, and specifically detecting results are shown in Table 1.
TABLE 1 characteristic fusion module experiment comparison result table
Figure BDA0003806409080000131
As can be seen from table 1, compared with the case that a feature fusion module is not used, the detection accuracy of the conventional FPN for small targets such as "person" and "bicycle" is improved to a certain extent, but the overall effect is not obvious, and the mAP (average AP) is increased by 0.63%; the bidirectional FPN not only improves the detection effect of small targets, but also improves the detection precision of large target 'car', and the mAP value of the bidirectional FPN is improved to 50.83% from the original 49.61%; the IBFPN module designed by the invention obviously enhances the detection effect on large and small targets, and the overall mAP value is improved by 2.03%.
Step 8, ECBAM module verification
In order to verify the effectiveness of the attention module ECBAM designed by the invention on the SSD detection algorithm, an experiment adds different attention modules for each feature branch for processing, specifically: the detection module is directly input without additional processing on the branch, the channel attention module SE is added after the branch, the mixed attention module CBAM is added, and the ECBAM module which is improved and designed by the invention is added, and the specific detection result is shown in the table 2.
TABLE 2 attention model experiment comparison results table
Attention module Is not used SENet CBAM ECBAM
mAP(%) 49.61% 50.17% 50.62% 51.04%
As can be seen from table 2, the target detection effect is improved to a certain extent after the attention module is added, wherein the detection effect of the mixed attention module is better than that of the single channel attention module, and the value of the mAP of the CBAM is 0.45% higher than that of the SEnet; the ECBAM changes the sharing full connection in the CBAM into one-dimensional convolution, reduces the complexity, enhances the accuracy and improves the mAP value by 0.42 percent; compared with the method without using the attention module, the mAP value of the ECBAM is improved by 1.43%, which shows that the influence caused by irrelevant background can be effectively inhibited by using the attention module, so that the false detection rate is reduced, and the detection accuracy of the target is increased.
Step 9, contrast analysis of the whole detection algorithm
In order to fully illustrate the effectiveness of the algorithm of the present invention, the SSD _ bipfn _ ECBAM algorithm designed by the present invention and various detection algorithms are compared and analyzed on the FLIR data set, wherein the SSD algorithm using different feature extraction networks, the SSD _ MobileNet algorithm and the DSSD algorithm performing different feature fusion are included, and the detection results are shown in table 3.
Table 3 table of experimental comparison results of detection algorithm
network Backbone mAP(%) FPS
SSD VGG16 49.98% 16.7
SSD MobileNet 47.87% 20.5
DSSD ResNet 52.84% 15.7
SSD_BIFPN_ECBAM MobileNet 53.02% 17.6
As can be seen from Table 3, the target detection method based on feature fusion and attention mechanism designed by the invention surpasses other algorithms in precision. Compared with the original SSD _ VGG16 algorithm, the detection precision of the algorithm is respectively improved by 3.04% and 0.9FPS; compared with the SSD _ MobileNet algorithm, the detection precision of the algorithm is improved by 5.15 percent; the precision of the DSSD algorithm is close to that of the algorithm, but the detection speed of the DSSD algorithm is lower than that of the algorithm.
As shown in fig. 7 and 8. Wherein, fig. 7 contains cars and pedestrians, it can be seen that the algorithm successfully detects the target in the frame, wherein the right side of the figure only shows a few car head portions of a passing car, but the algorithm still successfully detects the target; the difficulty of the scene in fig. 8 is that the cars are shielded from each other or from the surrounding environment, but it can be seen that the algorithm still successfully detects all the targets in the above two complex scenes, and has a good detection effect.

Claims (7)

1. A target detection method based on feature fusion and attention mechanism is characterized by comprising the following steps:
step 1, inputting the infrared image into a MobileNet network to perform layer-by-layer convolution calculation to obtain feature maps with different scales, wherein 6 layers of feature maps of DWS11, DWS13, DWS14_2, DWS15_2, DWS16_2 and DWS17_2 are feature pyramid images for detection;
step 2, establishing a bidirectional feature fusion module IBFPN, and inputting the feature pyramid image for detection into the IBFPN to perform mutual fusion of upper and lower layer information;
step 3, after IBFPN, inputting each fusion feature layer into an attention module ECBAM, and endowing different features with different specific gravities through the ECBAM;
step 4, sending DWS11, DWS13, DWS14_2, DWS15_2, DWS16_2 and DWS17_2 processed by IBFPN and ECBAM into a detection module for detection, and acquiring the category and the corresponding boundary frame of each candidate frame to obtain a prediction result;
and 5, performing non-maximum value suppression on the prediction result, and deleting redundant target frames to obtain a final detection result.
2. The method for target detection based on feature fusion and attention mechanism as claimed in claim 1, wherein in step 2, ibfpn is based on bi-directional pyramid network, constructing residual feature enhancement module RFA to enhance top-level features and introducing fusion path from bottom to top, and adding a connection from initial input to output for the same level.
3. The method for detecting an object based on feature fusion and attention mechanism according to claim 2, wherein in the step 2, the step of fusing the upper and lower layer information with each other is as follows:
step 2.1, in the forward propagation process, pyramid feature levels generated by the traditional SSD target detection model are sequentially { C1, C2, C3, C4, C5 and C6}, and respectively correspond to DWS11, DWS13, DWS14_2, DWS15_2, DWS16_2 and DWS17_2 in the step one;
step 2.2, inputting the characteristics of the C6 layer into RFA to improve the characteristic representation, and obtaining an output characteristic layer R6 with multi-scale context information, namely RFA (C6);
step 2.3, adding and fusing the R6 and the C6 subjected to 1 × 1 convolution dimensionality reduction to obtain a top-layer feature C6_ td;
step 2.4, performing 1 × 1 convolution dimensionality reduction on the C1-C5 to obtain features C5_ in-C1 _ in;
step 2.5, fusing the C1_ in-C6 _ in from top to bottom by using the following formula:
Figure FDA0003806409070000021
wherein, conv is a1 × 1 convolution operation for feature channel dimension reduction, conv (C6) -Conv (C1), i.e. C6_ in-C1 _ in, RFA (x) is residual feature enhancement on the feature map, resize is an upsampling operation adopted to match the resolutions of different layers, and "+" represents addition of corresponding position elements; c1_ td-C6 _ td are top layer features obtained after addition and fusion;
and 2.6, fusing the low-level features to the high level from bottom to top, so that each level has not only high-level strong semantic information but also low-level strong detail positioning information, and adding a connection from initial input to output to fuse more features without increasing the number of parameters to obtain fused feature layers C1_ out-C6 _ out, wherein the fusion formula is as follows:
Figure FDA0003806409070000022
c1_ out-C6 _ out is a multi-scale feature fusion feature layer output after C1-C6 pass through IBFPN.
4. The method for detecting an object based on feature fusion and attention mechanism as claimed in claim 3, wherein said RFA is to introduce the concept of residual feature enhancement in AugFPN into Res residual structure and to incorporate into the adaptive spatial fusion module, and the RFA adds context information through residual features to reduce the loss of features at the highest level and to improve the performance of the pyramid.
5. The feature fusion and attention mechanism based target detection method of claim 3, wherein the upsampling operation is a bilinear interpolation.
6. The method for detecting an object based on feature fusion and attention mechanism according to claim 1 or 2 or 3 or 4 or 5, wherein the attention module ECBAM comprises a channel attention module and a spatial attention module, wherein the channel attention module focuses on the importance degree of different feature channels, including an average pooling layer AvgPool and a maximum pooling layer MaxPool, the one-dimensional convolution module and a Sigmoid mapping module, the spatial attention module focuses on the importance degree of different spaces of features, and focuses on where is an information part, and the spatial attention module is complementary to the channel attention module, including the average pooling layer AvgPool and the maximum pooling layer MaxPool, and the stacking module and the Sigmoid mapping module.
7. The method for detecting the target based on the feature fusion and attention mechanism as claimed in claim 6, wherein in the step 3, the step of giving different specific gravities to different features is as follows:
step 3.1, respectively inputting each fused feature layer in the step 2 into ECBAM, and setting the input fused feature layer as
Figure FDA0003806409070000031
H, W is the height and width of each fused feature layer, and C is the number of channels of the fused feature layers; for the
Figure FDA0003806409070000032
Firstly, two characteristic graphs are obtained through average pooling and maximum pooling compression respectively
Figure FDA0003806409070000033
And
Figure FDA0003806409070000034
step 3.2, ECBAM considers the interaction of each channel and its adjacent k channels, pair I avg 、I max Respectively using one-dimensional convolution with the size of k to carry out operation;
step 3.3, element addition is carried out on the two parts of results processed in the step 3.2, and each channel feature is mapped into (0,1) through a Sigmoid activation function, so that the weight coefficients M of different channels are obtained c (I) The calculation method is as follows:
Figure FDA0003806409070000035
wherein, the sigma is a Sigmoid activation function,
Figure FDA0003806409070000036
is a one-dimensional convolution of size k, the size of k being adaptively determined by:
Figure FDA0003806409070000037
wherein | andi odd Represents the nearest odd number of the result, γ =2, b =1;
step 3.4, adding M c (I) And merging feature layers
Figure FDA0003806409070000038
Multiplying to obtain a channel attention feature map I',
Figure FDA0003806409070000039
step 3.5, inputting the new input feature map I 'into a space attention module, and respectively carrying out average pooling and maximum pooling on all channels of the same feature point I' to obtain
Figure FDA0003806409070000041
And
Figure FDA0003806409070000042
step 3.6, mixing
Figure FDA0003806409070000043
Figure FDA0003806409070000044
Figure FDA0003806409070000045
Performing standard convolution after stacking;
step 3.7, obtaining the spatial weight M of I' by using Sigmoid activation function s (I'), the calculation process is as follows:
M s (I′)=σ(f 7×7 ([avgPool(I′);maxPool(I′)])
=σ(f 7×7 ([I′ avg ;I′ max ])
wherein f is 7×7 Is a convolution kernel of size 7 x 7.
Step 3.8, obtaining the space weight M s (I') and channel attention profile
Figure FDA0003806409070000046
And multiplying to obtain a final feature map which can be sent for detection, wherein the feature maps at the moment are the feature maps of DWS11, DWS13, DWS14_2, DWS15_2, DWS16_2 and DWS17_2 after IBFPN and ECBAM processing.
CN202210998016.0A 2022-08-19 2022-08-19 Target detection method based on feature fusion and attention mechanism Pending CN115424104A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210998016.0A CN115424104A (en) 2022-08-19 2022-08-19 Target detection method based on feature fusion and attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210998016.0A CN115424104A (en) 2022-08-19 2022-08-19 Target detection method based on feature fusion and attention mechanism

Publications (1)

Publication Number Publication Date
CN115424104A true CN115424104A (en) 2022-12-02

Family

ID=84197543

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210998016.0A Pending CN115424104A (en) 2022-08-19 2022-08-19 Target detection method based on feature fusion and attention mechanism

Country Status (1)

Country Link
CN (1) CN115424104A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115631422A (en) * 2022-12-23 2023-01-20 国家海洋局东海信息中心 Enteromorpha recognition method based on attention mechanism
CN115861772A (en) * 2023-02-22 2023-03-28 杭州电子科技大学 Multi-scale single-stage target detection method based on RetinaNet
CN116310709A (en) * 2023-02-03 2023-06-23 江苏科技大学 Lightweight infrared target detection method based on improved PF-YOLO
CN116664462A (en) * 2023-05-19 2023-08-29 兰州交通大学 Infrared and visible light image fusion method based on MS-DSC and I_CBAM
CN116895050A (en) * 2023-09-11 2023-10-17 四川高速公路建设开发集团有限公司 Tunnel fire disaster identification method and device

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115631422A (en) * 2022-12-23 2023-01-20 国家海洋局东海信息中心 Enteromorpha recognition method based on attention mechanism
CN115631422B (en) * 2022-12-23 2023-04-28 国家海洋局东海信息中心 Enteromorpha identification method based on attention mechanism
CN116310709A (en) * 2023-02-03 2023-06-23 江苏科技大学 Lightweight infrared target detection method based on improved PF-YOLO
CN115861772A (en) * 2023-02-22 2023-03-28 杭州电子科技大学 Multi-scale single-stage target detection method based on RetinaNet
CN116664462A (en) * 2023-05-19 2023-08-29 兰州交通大学 Infrared and visible light image fusion method based on MS-DSC and I_CBAM
CN116664462B (en) * 2023-05-19 2024-01-19 兰州交通大学 Infrared and visible light image fusion method based on MS-DSC and I_CBAM
CN116895050A (en) * 2023-09-11 2023-10-17 四川高速公路建设开发集团有限公司 Tunnel fire disaster identification method and device
CN116895050B (en) * 2023-09-11 2023-12-08 四川高速公路建设开发集团有限公司 Tunnel fire disaster identification method and device

Similar Documents

Publication Publication Date Title
CN115424104A (en) Target detection method based on feature fusion and attention mechanism
CN111325751B (en) CT image segmentation system based on attention convolution neural network
CN112149504B (en) Motion video identification method combining mixed convolution residual network and attention
CN108985317B (en) Image classification method based on separable convolution and attention mechanism
CN114120019A (en) Lightweight target detection method
US20220215227A1 (en) Neural Architecture Search Method, Image Processing Method And Apparatus, And Storage Medium
CN110175986B (en) Stereo image visual saliency detection method based on convolutional neural network
CN112541409B (en) Attention-integrated residual network expression recognition method
CN113033570B (en) Image semantic segmentation method for improving void convolution and multilevel characteristic information fusion
CN111325165B (en) Urban remote sensing image scene classification method considering spatial relationship information
CN110059728B (en) RGB-D image visual saliency detection method based on attention model
CN111612008A (en) Image segmentation method based on convolution network
CN115690522B (en) Target detection method based on multi-pooling fusion channel attention and application thereof
CN111583285A (en) Liver image semantic segmentation method based on edge attention strategy
CN113239825B (en) High-precision tobacco beetle detection method in complex scene
CN112634146A (en) Multi-channel CNN medical CT image denoising method based on multiple attention mechanisms
CN112966747A (en) Improved vehicle detection method based on anchor-frame-free detection network
CN112149526B (en) Lane line detection method and system based on long-distance information fusion
CN115797635A (en) Multi-stage instance segmentation method and system based on parallel feature completion
CN114821341A (en) Remote sensing small target detection method based on double attention of FPN and PAN network
Kan et al. A GAN-based input-size flexibility model for single image dehazing
CN114998756A (en) Yolov 5-based remote sensing image detection method and device and storage medium
CN112967296B (en) Point cloud dynamic region graph convolution method, classification method and segmentation method
CN114049491A (en) Fingerprint segmentation model training method, fingerprint segmentation device, fingerprint segmentation equipment and fingerprint segmentation medium
CN112132207A (en) Target detection neural network construction method based on multi-branch feature mapping

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination