CN115424104A

CN115424104A - Target detection method based on feature fusion and attention mechanism

Info

Publication number: CN115424104A
Application number: CN202210998016.0A
Authority: CN
Inventors: 周慧鑫; 张伟鹏; 戴加乐; 于跃; 宋江鲁奇; 张嘉嘉; 李苗青; 张喆; 滕翔; 王珂; 李欢; 朱贺隆; 梅峻溪; 王财顺; 石志鹏
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2022-08-19
Filing date: 2022-08-19
Publication date: 2022-12-02

Abstract

An infrared image is input into a MobileNet network to carry out convolution calculation layer by layer to obtain feature maps with different scales, a bidirectional feature fusion module IBFPN is established, and the feature pyramid image used for detection is input into the IBFPN to carry out mutual fusion of upper-layer information and lower-layer information; after IBFPN, inputting each fusion characteristic layer into an attention module ECBAM, and endowing different characteristics with different specific gravities through the ECBAM; sending feature maps of different scales processed by IBFPN and ECBAM into a detection module for detection, and acquiring the category and corresponding boundary frame of each candidate frame to obtain a prediction result; and carrying out non-maximum value suppression on the prediction result, and deleting redundant target frames to obtain a final detection result. After comprehensively comparing various detection algorithms, experimental results show that the method can effectively improve the detection precision of the infrared target.

Description

Target detection method based on feature fusion and attention mechanism

Technical Field

The invention belongs to the field of infrared target detection, and particularly relates to a target detection method based on feature fusion and attention mechanism.

Background

The target detection technology has been the focus of research and attention as an important branch in the fields of image processing and computer vision. Object detection, also called object extraction, essentially classifies and pinpoints multiple objects in a given image. Different from visible light imaging, infrared imaging is a passive imaging technology, performs imaging by detecting infrared heat radiation emitted by an object, and has the advantages of long detection distance, strong penetrating power, full-time working and the like. Therefore, the infrared target detection technology combining target detection and an infrared imaging system can effectively make up for the deficiency of visible light detection, and further meet the requirements of different industries on the detection technology.

In recent years, the deep learning algorithm has excellent performance in the field of image processing, and different from the traditional manual design characteristics, the deep learning algorithm automatically extracts the characteristics through a convolutional neural network, and has good robustness and portability. Therefore, the infrared target detection algorithm based on deep learning has important research value. The SSD detection algorithm utilizes a backbone network and an additionally added convolution layer to generate 6 groups of feature maps with different scales so as to predict the candidate frame types and coordinates. The depth of the convolutional layer is different, and the image information included in the feature map is also different. The shallow feature map contains information such as texture and position of the image, but the extracted features are not comprehensive; the deep feature map contains richer semantic information, but loses much detail in the convolution process. The SSD only inputs the branch information into the detection module independently, and the characteristics of each layer are not mutually connected and supplemented. In addition, target information in the infrared image is not obvious and the characteristic structure is sparse, so that the phenomena of target missing detection, false detection and the like are easily caused in a complex background environment.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention aims to provide a target detection method based on feature fusion and attention mechanism, which improves the traditional SSD detection algorithm, strengthens the expression capability of detection branches, improves the network expression capability, and inhibits the influence caused by irrelevant information, thereby solving the problems of missed detection and false detection caused by complex background and the like.

In order to achieve the aim, the invention adopts the technical scheme that:

a target detection method based on feature fusion and attention mechanism comprises the following steps:

step 1, inputting the infrared image into a MobileNet network to perform layer-by-layer convolution calculation to obtain feature maps with different scales, wherein 6 layers of feature maps of DWS11, DWS13, DWS14_2, DWS15_2, DWS16_2 and DWS17_2 are feature pyramid images for detection;

step 2, establishing a bidirectional feature fusion module IBFPN, and inputting the feature pyramid image for detection into the IBFPN to perform mutual fusion of upper and lower layer information;

step 3, after IBFPN, inputting each fusion feature layer into an attention module ECBAM, and endowing different features with different specific gravities through the ECBAM;

step 4, sending DWS11, DWS13, DWS14_2, DWS15_2, DWS16_2 and DWS17_2 processed by IBFPN and ECBAM into a detection module for detection, and acquiring the category and the corresponding boundary frame of each candidate frame to obtain a prediction result;

and 5, performing non-maximum value suppression on the prediction result, and deleting redundant target frames to obtain a final detection result.

In an embodiment, in the step 2, for each feature map used for detection, compared with a conventional SSD algorithm that directly detects the feature map, each feature map is input into the bidirectional feature fusion module IBFPN before detection, so as to perform mutual fusion of upper and lower layer information. The IBFPN is based on the traditional bidirectional pyramid network, a residual error feature enhancement module RFA is constructed to enhance top-level features, a bottom-up fusion path is introduced, and meanwhile, a connection from an initial input to an output is added for the same level.

In an embodiment, in step 2, the step of fusing the upper and lower layer information with each other is as follows:

step 2.1, in the forward propagation process, pyramid feature levels generated by the traditional SSD target detection model are sequentially { C1, C2, C3, C4, C5 and C6}, and respectively correspond to DWS11, DWS13, DWS14_2, DWS15_2, DWS16_2 and DWS17_2 in the step one;

2.2, inputting the characteristics of the C6 layer into an RFA residual error characteristic enhancement module to improve the characteristic representation, and obtaining an output characteristic layer R6 with multi-scale context information, namely RFA (C6);

step 2.3, adding and fusing the R6 and the C6 subjected to 1 × 1 convolution dimensionality reduction to obtain a top-layer feature C6_ td;

step 2.4, performing 1 × 1 convolution dimensionality reduction on the C1-C5 to obtain features C5_ in-C1 _ in;

and 2.5, fusing the C1_ in-C6 _ in from top to bottom by using the following formula:

C6_td＝Conv(C6)+RFA(C6)

C5_td＝Conv(C5)+Resize(C6_td)

C1_td＝Conv(C1)+Resize(C2_td)

wherein, conv is a1 × 1 convolution operation for reducing dimensions of a feature channel, conv (C6) -Conv (C1) are C6_ in-C1 _ in, RFA (x) is residual feature enhancement on a feature map, resize is an up-sampling operation adopted for matching resolutions of different layers, and "+" represents addition of corresponding position elements; c1_ td-C6 _ td are top layer characteristics obtained after addition and fusion;

and 2.6, fusing the low-level features to the high level from bottom to top, so that each level has not only high-level strong semantic information but also low-level strong detail positioning information, and adding a connection from initial input to output to fuse more features without increasing the number of parameters to obtain fused feature layers C1_ out-C6 _ out, wherein the fusion formula is as follows:

C1_out＝C1_td＝C1_in+Resize(C2_td)

C2_out＝C2_td+Resize(C1_out)+C2_in

C6_out＝C6_td+Resize(C5_out)+C6_in

c1_ out-C6 _ out is a multi-scale feature fusion feature layer output after C1-C6 pass through IBFPN.

In one embodiment, the residual feature enhancement module RFA combines the concept of residual feature enhancement in AugFPN in the conventional Res residual structure and incorporates an Adaptive Spatial Fusion module (ASF). In both the FPN and the bi-directional feature pyramid, the features at the highest level are propagated from top to bottom upsampled and gradually merged with the features at the lower levels. In the process, the features of the lower layer are enhanced by the semantic information from the higher layer, and the fused features naturally contain different context information; however, due to resizing, the channel dimensions of the top-level features are compressed, which results in the loss of part of the information. For this situation, the RFA designed by the present invention adds context information by using residual features to reduce the loss of the highest-level features and improve the performance of the pyramid.

Illustratively, the upsampling operation is a bilinear interpolation.

In one embodiment, the Attention Module ECBAM is modified based on a Convolutional Attention Module (CBAM). The module mainly comprises two modules of channel attention and space attention, wherein the channel attention module focuses on importance degrees of different characteristic channels, and comprises an average pooling layer AvgPool and a maximum pooling layer MaxPool, a one-dimensional convolution module and a Sigmoid mapping module, the space attention module focuses on importance degrees of characteristics in different spaces, information parts are focused on 'where', the module is complementary with channel attention, and the module comprises the average pooling layer AvgPool, the maximum pooling layer MaxPool, a stacking module and the mosoid mapping module.

In one embodiment, the step 3, assigning different specific gravities to different features, is as follows:

step 3.1, inputting each fused feature layer in the step 2 into ECBAM, and setting the input fused feature layers (the output C1_ out-C6 _ out of the step 2) as

H, W is the height and width of each fused feature layer, and C is the number of channels of the fused feature layers; for the

Firstly, two characteristic graphs are obtained through average pooling and maximum pooling compression respectively

And

step 3.2, ECBAM considers the interaction of each channel and its adjacent k channels, pair I _avg 、I _max Respectively using one-dimensional convolution with the size of k to carry out operation;

step 3.3, element addition is carried out on the two parts of results processed in the step 3.2, and each channel feature is mapped into (0,1) through a Sigmoid activation function, so that the weight coefficients M of different channels are obtained _c (I) The calculation method is as follows:

wherein, the sigma is a Sigmoid activation function,

is a one-dimensional convolution of size k, the size of k being adaptively determined by:

wherein C represents a fusion feature layer

Number of channels, | purple _odd Represents the nearest odd number of the result, γ =2, b =1;

step 3.4, adding M _c And merging feature layers

Multiplying to obtain a channel attention feature map I',

step 3.5, inputting the new input feature map I 'into a space attention module, and respectively carrying out average pooling and maximum pooling on all channels of the same feature point I' to obtain

And

step 3.6, mixing

And

performing standard convolution after stacking;

step 3.7, obtaining the spatial weight M of I' by using Sigmoid activation function _s (I'), the calculation procedure is as follows:

M _s (I′)＝σ(f ^7×7 ([avgPool(I′)；maxPool(I′)])

＝σ(f ^7×7 ([I _a ′ _vg ；I′ _max ])

wherein f is ^7×7 Is a convolution kernel of size 7 x 7.

Step 3.8, the obtained space weight M _s (I') and channel attention profile

And multiplying to obtain a final feature map which can be sent for detection, wherein the feature maps at the moment are the feature maps of DWS11, DWS13, DWS14_2, DWS15_2, DWS16_2 and DWS17_2 after IBFPN and ECBAM processing.

Compared with the prior art, the invention has the following beneficial effects:

the SSD detection algorithm based on the feature fusion and attention mechanism designed by the invention surpasses other algorithms in precision. Compared with the original SSD _ VGG16 algorithm, the detection precision and the algorithm efficiency of the algorithm are respectively improved by 3.04% and 0.9FPS; compared with the SSD _ MobileNet algorithm, the detection precision of the algorithm is improved by 5.15 percent; the precision of the DSSD algorithm is close to that of the algorithm, but the detection speed of the DSSD algorithm is lower than that of the algorithm.

Drawings

Fig. 1 is a schematic diagram of a feature pyramid network FPN and a local structure.

Fig. 2 is a schematic diagram of a residual feature enhancement module.

Fig. 3 is a schematic structural diagram of an IBFPN module with bidirectional feature pyramid.

Fig. 4 is a schematic diagram of a convolution attention module ECBAM structure.

Fig. 5 is a schematic structural diagram of a channel attention module.

Fig. 6 is a schematic structural diagram of a spatial attention module.

Fig. 7 shows the detection results of the present invention for vehicles and pedestrians.

FIG. 8 shows the detection results of the present invention under the condition that the vehicle environments are shielded from each other.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The method mainly researches targets in an infrared scene, and adopts an infrared data set FLIR to analyze and verify the algorithm in order to demonstrate the effectiveness of the method. The Ubuntu16.04 operating system is used in all experiments, and the NVIDIA-GTX1050Ti display card is loaded. Algorithms are written by python3.6.6, cuda9.0.136 is used for developing an architecture in parallel, and the construction, training and testing of algorithm models are carried out through a Pytrch framework. The overall training process of the detection model is divided into two stages: a freezing phase and a thawing phase. Freezing the main network of the model when the parameters are trained in the freezing stage, wherein the feature extraction network cannot be changed, the quantity of batch _ size of each batch is 16, the learning rate is 0.0005, and 50 epochs are trained; in the thawing training phase, the whole model is trained, all parameters are changed, the quantity of each batch, batch _ size, is 8, the learning rate is 0.0001, and 100 epochs are trained in total. The basic flow is as follows:

step 1, inputting an infrared image into a traditional lightweight network MobileNet to perform layer-by-layer convolution calculation, and obtaining feature maps with different scales, wherein 6 layers of feature maps of DWS11, DWS13, DWS14_2, DWS15_2, DWS16_2 and DWS17_2 are feature pyramid images for detection;

and 2, the SSD inputs the image into a feature extraction network in a pyramid feature level detection mode and carries out forward propagation to obtain feature maps with different scales, and then, the feature maps are respectively predicted. The method naturally generates the features with different scales through convolution without extra calculation, but each prediction only utilizes the features of one layer, and no information exchange and supplement between the features of different layers are realized. Specifically, the depth of the convolution layer in the SSD detection algorithm is different, and the image information included in the feature map is also different. The shallow feature map contains information such as texture and position of the image, but the extracted features are not comprehensive; the deep feature map contains richer semantic information, but loses much detail in the convolution process. The traditional SSD detection algorithm only inputs each branch information into the detection module independently, and the characteristics of each layer are not related and supplemented. The shallow feature can be strengthened layer by adding an FPN structure in the SSD algorithm, but the position information contained in the shallow feature is not effectively transmitted to the deep feature.

Therefore, the invention establishes the bidirectional feature fusion module IBFPN capable of effectively transmitting shallow information on the basis of the original pyramid feature level. And inputting the characteristic pyramid image for detection into the IBFPN to perform mutual fusion of upper and lower layer information, thereby improving the detection performance of the target.

And 2.1, continuously sampling the top layer features, fusing the top layer features with the features of other layers to obtain a reinforced feature map, reconstructing a new feature pyramid layer by the feature map, and inputting the feature pyramid layer into a detection module to predict the category and the position of the target. As shown in fig. 1, the FPN structure can be divided into two parts, i.e. a bottom-up path from the left side in the figure, and a top-down path from the right side and a horizontal connection of the middle layer.

The bottom-up path is the process of forward convolution operation of the underlying network. In the process, the feature map passes through a series of convolution kernels, after the convolution kernel with the step size of 2, the size of the feature map is reduced to half of the original size, and after the convolution kernel with the step size of 1, the size of the feature map is kept unchanged, so that adjacent convolution layers with the same size are regarded as the same stage. Because the semantic information extracted by the network is richer as the convolution is deepened, the last layer of feature map at each stage is used as a representative of the scale features, and the representative feature maps with different scales form a pyramid level, as shown by C1, C2 and C3 in FIG. 1.

The top-down process is a process of progressively fusing deep features with shallow features, as shown at C4, C5, and C6 in fig. 1, as opposed to the bottom-up process. The method specifically comprises the following steps: firstly, amplifying the size of the high-level feature with stronger feature through up-sampling to make the size of the high-level feature be the same as that of the next-level feature map; then, the feature layer to be fused is convolved by 1 × 1 to adjust the number of channels to be consistent with the high-level feature map, and then the two parts of features are transversely connected to add elements to obtain the fused feature map. And repeating the operation on the other feature layers to be fused after the feature fusion operation of the two adjacent layers is finished, so as to obtain a new feature pyramid layer.

And 2.2, in the characteristic pyramid, the characteristics of the highest layer are all subjected to top-down upsampling propagation and are gradually fused with the characteristics of the lower layer. In the process, the features of the lower layer are enhanced by the semantic information from the higher layer, and the fused features naturally contain different context information; however, due to resizing, the channel dimensions of the top-level features are compressed, which results in the loss of part of the information. When designing a Feature fusion module, the invention combines the concept of Residual Feature enhancement in AugFPN in the traditional Res Residual structure, and an introduced Residual Feature enhancement module (RFA) adds different context information to an original Feature layer by using Residual branches to reduce the loss of the highest-layer Feature and improve the performance of a pyramid.

Step 2.3, considering aliasing influence caused by interpolation, simple addition operation cannot be performed on the context features of each part, so an Adaptive Spatial Fusion (ASF) module is added to the RFA residual feature enhancement module, and the specific structure of the ASF module is shown in fig. 2 right, and the context features can be better combined through the ASF module. Specifically, the module takes the features of the upsampled parts as input, generates a spatial weight for each feature through operations such as splicing, convolution and activation, and aggregates the contextual features into a new feature layer R by using the weights, wherein the feature layer has multi-scale contextual information. After the ASF generates the aggregation feature layer R, the R and the highest layer feature are added in a summation mode to strengthen the scale information, and then the strengthened highest layer feature is used for subsequent scale transformation and feature fusion.

Step 2.4, obviously, the original SSD algorithm does not complement each other between the detection branches at this time, based on the above ideas, the present invention designs a new bidirectional Feature fusion module, that is, an Improved Bidirectional Feature Pyramid Network (IBFPN), which is an Improved module based on the conventional bidirectional Feature Pyramid Network, introduces RFA to strengthen the top-level Feature, introduces a bottom-up fusion path, and adds a connection from the beginning to the output for the same level, as shown in fig. 3 for each Feature graph used for detection. The RFA module introduced at the top of the figure three can input C6 layer features into the module to improve its feature representation, resulting in R6 with multi-scale context information. The right part represents a bottom-up path to further fuse the low-level features into the high level, so that each level not only has high-level strong semantic information, but also has low-level strong detail positioning information. In addition, the structure designed by the invention also aims at the same level, and a connection from the initial input to the output is newly added, as shown by a red connecting line in fig. 3, so that more features can be fused without increasing the number of parameters. Compared with the traditional SSD algorithm which directly detects the feature maps, the method inputs all the feature maps into the bidirectional feature fusion module IBFPN before detection so as to perform mutual fusion of upper-layer information and lower-layer information.

Step 2.6, specifically, the pyramid feature levels generated by the conventional SSD object detection model in the forward propagation process are sequentially { C1, C2, C3, C4, C5, C6}, and respectively correspond to DWS11, DWS13, DWS14_2, DWS15_2, DWS16_2, and DWS17_2 in step one.

And 2.7, inputting the characteristics of the C6 layer into RFA to improve the characteristic representation, and obtaining R6 with multi-scale context information through an ASF module, namely RFA (C6).

And 2.8, adding and fusing the R6 and the C6 subjected to the 1 × 1 convolution dimensionality reduction to obtain a top-layer feature C6_ td. Performing 1 × 1 convolution dimensionality reduction on C1-C5 to obtain features C5_ in-C1 _ in;

and 2.9, fusing the C1_ in-C6 _ in from top to bottom by using the following formula.

C6_td＝Conv(C6)+RFA(C6)

C5_td＝Conv(C5)+Resize(C6_td)

C1_td＝Conv(C1)+Resize(C2_td)

Wherein, conv is a1 × 1 convolution operation for reducing dimensions of a feature channel, conv (C6) -Conv (C1), i.e., C6_ in-C1 _ in, RFA (×) is residual feature enhancement performed on a feature map, resize is an upsampling operation performed to match resolutions of different layers, here, bilinear interpolation, and "+" indicates addition of corresponding position elements. C1_ td-C6 _ td are top layer characteristics obtained after addition and fusion.

And 2.10, newly aggregating a bottom-up fusion path to represent that the bottom-up fusion path further fuses the low-level features into the high level, so that each level has not only high-level strong semantic information but also low-level strong detail positioning information. The invention also adds a new connection from the initial input to the output aiming at the same level, as shown by a red connecting line in figure 3, so that more features can be fused under the condition of not increasing the number of parameters to obtain fused feature layers C1_ out-C6 _ out, and a fusion formula is as follows.

C1_out＝C1_td＝C1_in+Resize(C2_td)

C2_out＝C2_td+Resize(C1_out)+C2_in

C6_out＝C6_td+Resize(C5_out)+C6_in

C1_ out to C6_ out are feature layers of multi-scale feature fusion output after C1 to C6 pass through IBFPN.

And 3, aiming at the problem that target information in the infrared image is not obvious and the characteristic structure is sparse, the phenomena of target missing detection, false detection and the like are easily caused by a complex background environment, in order to avoid the influence caused by irrelevant information in the original SSD detection algorithm, the invention improves the CBAM to obtain a new mixed attention module ECBAM, and after IBFPN, all the fused characteristic layers are continuously input into the ECBAM, different proportions are given to different characteristics through the ECBAM, so that the model focuses on a target part and focuses on an interested target area more, the network expression capacity is improved, and the influence caused by the irrelevant information is inhibited.

Step 3.1, an original convolution Attention Module (CBAM) mainly includes a channel Attention Module and a space Attention Module, and for the input feature map, the processing of channel Attention and space Attention is sequentially executed, so as to achieve the purpose of giving different specific gravities to features with different degrees of importance. However, during channel attention operation, the CBAM uses full-connection processing of parameter sharing for the results of maximum pooling and average pooling, which may result in a model that does not well consider the mapping of both features at the same time; on the other hand, the fully connected layer takes into account the correlation of global features, thus resulting in too high model complexity and computation. Aiming at the problem, the invention provides an improved convolution attention module according to the idea of local cross-channel interaction.

And 3.2, the ECBAM is obtained by improving on the basis of CBAM. The specific structure of the module is shown in fig. 4, and mainly includes a channel attention module and a space attention module, wherein the channel attention module focuses on the importance degree of different feature channels, and focuses on the information of what is concerned, the space attention module focuses on the importance degree of features in different spaces, and focuses on where is the information part, and the module and the channel attention are complementary. Assume that the input feature map is

Pass through the channel attention module in turn

And spatial attention module

The final processing result can be obtained by the operation of (2).

The specific structure of the channel attention module is shown in fig. 5, and includes an averaging pooling layer AvgPool and a maximum pooling layer MaxPool, a one-dimensional convolution module and a Sigmoid mapping module. In order to highlight the relationship between channels, the length and width of the input image need to be compressed, and the average pooling and the maximum pooling can characterize the importance of different features from different aspects. The one-dimensional convolution module considers the interaction of each channel and the adjacent k channels thereof, reduces the parameter quantity to a constant magnitude while bringing performance gain, and can well consider the mapping of two different characteristics in the pooling layer. The Sigmoid mapping module maps each feature into (0,1).

The specific structure of the spatial attention module is shown in fig. 6, and includes an average pooling layer AvgPool, a maximum pooling layer MaxPool, a stacking module and a Sigmoid mapping module. The pooling and maximal pooling are first averaged over the channel dimension and then the signatures they generate are stitched together. A convolution operation is then used on the stitched feature map to produce the final spatial attention feature map.

In order to give different specific gravity to different characteristics, each fused characteristic layer in the step 2 is respectively input into the ECBAM, and the input fused characteristic layers (the output C1_ out-C6 _ out of the step 2) are set as

H, W is the height and width of each fused feature layer, respectively, and C is the number of channels in the fused feature layer. Each fused feature layer passes through the channel attention module in turn

And spatial attention module

The final processing result can be obtained by the operation of (2). The above calculation of each part can be expressed as the following formula.

Wherein

Representing element-based multiplication, I' is the result after channel attention processing, and I "is the final result.

Step 3.3, for the features in step 2

First pass through the flatUniform pooling and maximum pooling compression to obtain two characteristic maps

And

and 3.4, in the original CBAM, the full connection processing of parameter sharing of the CBAM can cause that the model can not well consider the mapping of two characteristics at the same time, and the complexity and the calculation amount of the model are overhigh. Unlike shared fully-connected operations in CBAM, ECBAM considers the interaction of each channel and its adjacent k channels, pair I _avg 、I _max The operation is performed using a one-dimensional convolution of size k, respectively, which brings performance gains while reducing the parameter quantities to a constant magnitude.

Step 3.5, element addition is carried out on the two parts of results processed in the step 3.4, and each channel feature is mapped into (0,1) through a Sigmoid activation function, so that the weight coefficients M of different channels can be obtained _c (I) The calculation formula is as follows:

where σ is the Sigmoid activation function,

is a one-dimensional convolution of size k, the size of k being adaptively determined by the following equation.

Wherein C represents a fusion feature layer

Number of channels, | purple _odd Represents the nearest odd number of the result, γ =2,b =1.

In the step 3.6, the step of the method,finally, M is _c (I) Merging feature layers with input features

Multiplying to obtain a channel attention feature map I',

as shown in FIG. 5, I 'is used as a new input feature map and input into the spatial attention module, and average pooling and maximum pooling are respectively performed on all channels of the same feature point of I' to obtain

And

step 3.7, will

And

standard convolution is performed after stacking.

Step 3.8, obtaining the spatial weight M of I' by using Sigmoid activation function _s (I'), the calculation process can be expressed as the following formula.

M _s (I′)＝σ(f ^7×7 ([avgPool(I′)；maxPool(I′)])

＝σ(f ^7×7 ([I _avg ；I′ _max ])

Where σ is the Sigmoid activation function, f ^7×7 Is a convolution kernel of size 7 x 7.

The obtained spatial weight M _s (I') and channel attention profile

Multiplying to obtain the final characteristic diagram which can be sent to detection.

And 4, sending the DWS11, DWS13, DWS14_2, DWS15_2, DWS16_2 and DWS17_2 feature maps subjected to IBFPN and ECBAM processing into a detection module for detection, and acquiring the category and the corresponding boundary box of each candidate box to obtain a prediction result.

And 5, performing non-maximum suppression on the prediction result, and deleting redundant target frames to obtain a final detection result.

And 6, carrying out experimental verification on each module, and carrying out verification analysis on the whole detection algorithm.

And 7, verifying an IBFPN module, designing a comparison experiment by respectively adopting the original characteristic pyramid level, the traditional FPN structure, the bidirectional FPN structure and the IBFPN designed by the invention to perform characteristic fusion based on the MobileNet characteristic extraction network, and specifically detecting results are shown in Table 1.

TABLE 1 characteristic fusion module experiment comparison result table

As can be seen from table 1, compared with the case that a feature fusion module is not used, the detection accuracy of the conventional FPN for small targets such as "person" and "bicycle" is improved to a certain extent, but the overall effect is not obvious, and the mAP (average AP) is increased by 0.63%; the bidirectional FPN not only improves the detection effect of small targets, but also improves the detection precision of large target 'car', and the mAP value of the bidirectional FPN is improved to 50.83% from the original 49.61%; the IBFPN module designed by the invention obviously enhances the detection effect on large and small targets, and the overall mAP value is improved by 2.03%.

Step 8, ECBAM module verification

In order to verify the effectiveness of the attention module ECBAM designed by the invention on the SSD detection algorithm, an experiment adds different attention modules for each feature branch for processing, specifically: the detection module is directly input without additional processing on the branch, the channel attention module SE is added after the branch, the mixed attention module CBAM is added, and the ECBAM module which is improved and designed by the invention is added, and the specific detection result is shown in the table 2.

TABLE 2 attention model experiment comparison results table

Attention module	Is not used	SENet	CBAM	ECBAM
					mAP(％)	49.61％	50.17％	50.62％	51.04％

As can be seen from table 2, the target detection effect is improved to a certain extent after the attention module is added, wherein the detection effect of the mixed attention module is better than that of the single channel attention module, and the value of the mAP of the CBAM is 0.45% higher than that of the SEnet; the ECBAM changes the sharing full connection in the CBAM into one-dimensional convolution, reduces the complexity, enhances the accuracy and improves the mAP value by 0.42 percent; compared with the method without using the attention module, the mAP value of the ECBAM is improved by 1.43%, which shows that the influence caused by irrelevant background can be effectively inhibited by using the attention module, so that the false detection rate is reduced, and the detection accuracy of the target is increased.

Step 9, contrast analysis of the whole detection algorithm

In order to fully illustrate the effectiveness of the algorithm of the present invention, the SSD _ bipfn _ ECBAM algorithm designed by the present invention and various detection algorithms are compared and analyzed on the FLIR data set, wherein the SSD algorithm using different feature extraction networks, the SSD _ MobileNet algorithm and the DSSD algorithm performing different feature fusion are included, and the detection results are shown in table 3.

Table 3 table of experimental comparison results of detection algorithm

network	Backbone	mAP(％)	FPS
				SSD	VGG16	49.98％	16.7
SSD	MobileNet	47.87％	20.5
				DSSD	ResNet	52.84％	15.7
SSD_BIFPN_ECBAM	MobileNet	53.02％	17.6

As can be seen from Table 3, the target detection method based on feature fusion and attention mechanism designed by the invention surpasses other algorithms in precision. Compared with the original SSD _ VGG16 algorithm, the detection precision of the algorithm is respectively improved by 3.04% and 0.9FPS; compared with the SSD _ MobileNet algorithm, the detection precision of the algorithm is improved by 5.15 percent; the precision of the DSSD algorithm is close to that of the algorithm, but the detection speed of the DSSD algorithm is lower than that of the algorithm.

As shown in fig. 7 and 8. Wherein, fig. 7 contains cars and pedestrians, it can be seen that the algorithm successfully detects the target in the frame, wherein the right side of the figure only shows a few car head portions of a passing car, but the algorithm still successfully detects the target; the difficulty of the scene in fig. 8 is that the cars are shielded from each other or from the surrounding environment, but it can be seen that the algorithm still successfully detects all the targets in the above two complex scenes, and has a good detection effect.

Claims

1. A target detection method based on feature fusion and attention mechanism is characterized by comprising the following steps:

2. The method for target detection based on feature fusion and attention mechanism as claimed in claim 1, wherein in step 2, ibfpn is based on bi-directional pyramid network, constructing residual feature enhancement module RFA to enhance top-level features and introducing fusion path from bottom to top, and adding a connection from initial input to output for the same level.

3. The method for detecting an object based on feature fusion and attention mechanism according to claim 2, wherein in the step 2, the step of fusing the upper and lower layer information with each other is as follows:

step 2.2, inputting the characteristics of the C6 layer into RFA to improve the characteristic representation, and obtaining an output characteristic layer R6 with multi-scale context information, namely RFA (C6);

step 2.5, fusing the C1_ in-C6 _ in from top to bottom by using the following formula:

wherein, conv is a1 × 1 convolution operation for feature channel dimension reduction, conv (C6) -Conv (C1), i.e. C6_ in-C1 _ in, RFA (x) is residual feature enhancement on the feature map, resize is an upsampling operation adopted to match the resolutions of different layers, and "+" represents addition of corresponding position elements; c1_ td-C6 _ td are top layer features obtained after addition and fusion;

4. The method for detecting an object based on feature fusion and attention mechanism as claimed in claim 3, wherein said RFA is to introduce the concept of residual feature enhancement in AugFPN into Res residual structure and to incorporate into the adaptive spatial fusion module, and the RFA adds context information through residual features to reduce the loss of features at the highest level and to improve the performance of the pyramid.

5. The feature fusion and attention mechanism based target detection method of claim 3, wherein the upsampling operation is a bilinear interpolation.

6. The method for detecting an object based on feature fusion and attention mechanism according to claim 1 or 2 or 3 or 4 or 5, wherein the attention module ECBAM comprises a channel attention module and a spatial attention module, wherein the channel attention module focuses on the importance degree of different feature channels, including an average pooling layer AvgPool and a maximum pooling layer MaxPool, the one-dimensional convolution module and a Sigmoid mapping module, the spatial attention module focuses on the importance degree of different spaces of features, and focuses on where is an information part, and the spatial attention module is complementary to the channel attention module, including the average pooling layer AvgPool and the maximum pooling layer MaxPool, and the stacking module and the Sigmoid mapping module.

7. The method for detecting the target based on the feature fusion and attention mechanism as claimed in claim 6, wherein in the step 3, the step of giving different specific gravities to different features is as follows:

step 3.1, respectively inputting each fused feature layer in the step 2 into ECBAM, and setting the input fused feature layer as