CN115439706A

CN115439706A - A multi-receptive field attention mechanism and system based on target detection

Info

Publication number: CN115439706A
Application number: CN202210523305.5A
Authority: CN
Inventors: 王改华; 甘鑫; 曹清程; 翟乾宇; 王能元
Original assignee: Hubei University of Technology
Current assignee: Hubei University of Technology
Priority date: 2022-05-13
Filing date: 2022-05-13
Publication date: 2022-12-06

Abstract

The invention relates to a target detection-based multi-receptive-field attention mechanism and a system. The attention mechanism acquires feature maps of different receptive fields to obtain rich context information. Each feature map is divided into two groups, weight extraction of each feature map is realized through one-dimensional convolution of different receptive fields, then five groups of weights are fused and multiplied by input feature vectors, and channel attention learning of local cross-channel interaction without dimension reduction is realized. Experiments show that the module provided by the invention can improve the detection precision of different networks, and the visualization result shows that the networks provided by the invention have good detection effect in different scenes.

Description

A multi-receptive field attention mechanism and system based on target detection

技术领域technical field

本发明属于基于深度学习的目标检测领域。具体地指一种基于目标检测的多感受野注意力机制及系统。The invention belongs to the field of target detection based on deep learning. Specifically, it refers to a multi-receptive field attention mechanism and system based on target detection.

背景技术Background technique

随着计算机的技术理论的发展和硬件设备的提高，基于深度学习的目标检测得到广泛地应用。由于卷积神经网络(CNN)可以替代人类手工提取特征等优点，CNN被普遍应用于提取图片特征。With the development of computer technology theory and the improvement of hardware equipment, object detection based on deep learning has been widely used. Due to the advantages of convolutional neural network (CNN) that can replace human manual feature extraction, CNN is widely used to extract image features.

网络对语义特征提取的丰富程度直接有影响到目标检测算法的精度。注意力机制主要用于加强网络对重要的通道特征关注或者提高网络对图片中感兴趣区域的注意，该方法常用于深度学习算法中用于提高语义信息。The richness of the network for semantic feature extraction directly affects the accuracy of the target detection algorithm. The attention mechanism is mainly used to strengthen the network's attention to important channel features or to improve the network's attention to the region of interest in the picture. This method is often used in deep learning algorithms to improve semantic information.

注意力机制源于对人类视觉的研究，最早应用于自然语言领域，用于实现信息处理资源的高效分配。近年来，注意力机制在计算机视觉领域得到快速发展。 2017年，Dongcai Cheng等人提出了SENET，这个注意力机制通过挤压，激励，以及scale三部分实现了通道注意力机制，SENET直接采用两个全连接层来提取特征权重，但是其并不能很好的捕获通道间特征的依赖性。CABM在SENET的基础上在挤压阶段增加了一条最大池化并行支路，使得获得的信息更加全面，但是它任然采用了两个全连接层提取权重。The attention mechanism originated from the study of human vision and was first applied in the field of natural language to realize the efficient allocation of information processing resources. In recent years, the attention mechanism has been developed rapidly in the field of computer vision. In 2017, Dongcai Cheng and others proposed SENET. This attention mechanism realizes the channel attention mechanism through extrusion, incentives, and scale. SENET directly uses two fully connected layers to extract feature weights, but it is not very good. Good capture of feature dependencies between channels. Based on SENET, CABM adds a maximum pooling parallel branch in the extrusion stage, which makes the obtained information more comprehensive, but it still uses two fully connected layers to extract weights.

为了提升网络对图片的特征提取，注意力机制通常采用扩大感受野增强全局信息，提取权重加强或者抑制通道特征等方式。本文提出了多感受野注意力机制。它使用4个不同感受野的平行分支，既扩大了感受野，也防止了因感受野过大而导致的细节丢失。每个分支分为两组，使用不同大小的一维卷积来获得通道权重，它可有效地防止因通道下降而带来的通道依赖损失。In order to improve the network's feature extraction of pictures, the attention mechanism usually adopts methods such as expanding the receptive field to enhance global information, extracting weights to strengthen or suppress channel features, etc. This paper proposes a multi-receptive field attention mechanism. It uses 4 parallel branches with different receptive fields, which not only expands the receptive field, but also prevents the loss of details caused by too large receptive field. Each branch is divided into two groups, and one-dimensional convolutions of different sizes are used to obtain channel weights, which can effectively prevent channel dependency loss caused by channel drop.

发明内容Contents of the invention

为了克服上述不足，本发明提出了一种基于目标检测的多感受野注意力机制及系统，既扩大了感受野，又有效地地防止因通道下降而带来的通道依赖损失。In order to overcome the above shortcomings, the present invention proposes a multi-receptive field attention mechanism and system based on target detection, which not only expands the receptive field, but also effectively prevents the loss of channel dependence caused by channel drop.

本发明所设计的一种基于目标检测的多感受野注意力机制，其特殊之处在于，包括如下步骤：A kind of multi-receptive field attention mechanism based on target detection designed by the present invention is special in that it includes the following steps:

步骤1，分别以四个互不相同的卷积核对输入图像的批处理、通道数、空间高度和宽度进行卷积，得到四个不同感受野的张量[x1,x2,x3,x4]，并将所有张量相加得到x5；Step 1. Convolute the batch, channel number, spatial height and width of the input image with four different convolution kernels to obtain four tensors [x1,x2,x3,x4] with different receptive fields. and sum all tensors to get x5;

步骤2，将以上5个张量分别在通道维度上分为两组，使用两个不同卷积核大小的C模块来获得每组的通道权重，然后将通道维度中的两个组连接起来，以获得每个张量的权重；Step 2, divide the above 5 tensors into two groups in the channel dimension, use two C modules with different convolution kernel sizes to obtain the channel weight of each group, and then connect the two groups in the channel dimension, to get the weight of each tensor;

其中，C模块由以下公式表示：Among them, the C module is represented by the following formula:

F_{C module k*k}(X_CH)＝F_unF_sgW_1dF_sF_aX_CH F _{C module k*k} (X _CH )＝F _un F _sg W _1d F _s F _a X _CH

其中，X_CH为输入特征图，F_a是一个自适应平均池化算子，F_sg是一个sigmoid 算子，W_1d是一个k×k卷积层，F_s是一个压缩和交换算子，F_un是一个解压缩和交换算子，k*k为局部跨通道交互中使用的卷积核；权重5的输出公式为。Among them, X _CH is the input feature map, F _a is an adaptive average pooling operator, F _sg is a sigmoid operator, W _1d is a k×k convolutional layer, F _s is a compression and exchange operator, F _un is a decompression and exchange operator, k*k is the convolution kernel used in local cross-channel interaction; the output formula of weight 5 is .

X_{5_1},X_{5_2}＝F_spX₅ X _{5_1} , X _{5_2} = F _sp X ₅

weight5＝concat(F_{C module 3*3}(X_{5_1}),F_{C module 5*5}(X_{5_2}))weight5＝concat(F _{C module 3*3} (X _{5_1} ),F _{C module 5*5} (X _{5_2} ))

其中F_sp是组运算算子，concat是拼接算子；Among them, F _sp is a group operation operator, and concat is a splicing operator;

步骤3，最后，融合所有通道权重，然后将权重乘以对应输入向量。并在通道打乱后得到输出；通道打乱算子是在不增加计算量的情况下整合通道；Step 3. Finally, fuse all channel weights, and then multiply the weights by the corresponding input vectors. And get the output after the channel is shuffled; the channel shuffling operator is to integrate the channel without increasing the amount of calculation;

多感受野注意力机制的输出公式3为：The output formula 3 of the multi-receptive field attention mechanism is:

其中F_cs是通道打乱算子，⊙是乘法算子。Among them, F _cs is the channel scrambling operator, and ⊙ is the multiplication operator.

进一步地，步骤1中四个不同的卷积核在数字1-9中任意选取。Further, the four different convolution kernels in step 1 are randomly selected from numbers 1-9.

进一步地，所述四个不同的卷积分别为1*1、3*3、5*5、7*7。Further, the four different convolutions are 1*1, 3*3, 5*5, and 7*7 respectively.

进一步地，步骤2中卷积核大小分别为[3,5]。Further, the convolution kernel sizes in step 2 are [3, 5].

进一步地，X_CH为输入特征图，其大小为[B,C,H,W]，其中B、C、H、W分别表示批处理、通道数、空间高度和宽度。C模块它通过全局平均池化操作获得 X_a,X_a∈R^(B,C,1,1)，为了避免模型过于复杂，对X_a进行挤压置换，然后得到X_s,X_s∈R^(B,1,C)；之后，使用k*k的卷积核实现局部跨通道交互得到 X_c,X_c∈R^(B,1,C)；X_sg,X_sg∈R^(B,1,C)通过sigmoid激活函数获得；最后，解压和置换X_sg，然后得权重X_weight,X_weight∈R^(B,C,1,1)；Further, X _CH is the input feature map, whose size is [B, C, H, W], where B, C, H, W denote the batch, number of channels, spatial height and width, respectively. C module obtains X _a , X _a ∈ R ^(B,C,1,1) through the global average pooling operation. In order to avoid the model being too complicated, squeeze and replace X _a , and then get X _s , X _s ∈ R ^(B,1,C) ; After that, use the k*k convolution kernel to achieve local cross-channel interaction to get X _c , X _c ∈ R ^(B,1,C) ; X _sg , X _sg ∈ R ^{(B,1 ,C)} is obtained through the sigmoid activation function; finally, decompress and replace X _sg , and then get the weight X _weight , X _weight ∈ R ^(B,C,1,1) ;

通道打乱算子是在不增加计算量的情况下整合通道；X,X∈R^(B,C,H^,W)扩展成 X_cs,X_cs,∈R^{(B,G,C//G,H,W)}，再改变X_cs得到X_sc,X_sc∈R^{(B,C//G,G,H,W)}；最后还原成 X,X∈R^(B,C,H^,W)，实现全局信息交互。The channel scrambling operator is to integrate channels without increasing the amount of calculation; X,X∈R ^(B ,C,H ^{,W) is} expanded into X _cs ,X _cs ,∈R ^{(B,G,C//G ,H,W)} , then change X _cs to get X _sc ,X _sc ∈R ^{(B,C//G,G,H,W)} ; finally restore to X,X∈R ^(B ,C,H ^,W) , to achieve global information exchange.

基于同一发明构思，本发明还设计了一种用于实现所述基于目标检测的多感受野注意力机制的系统，其特殊之处在于：Based on the same inventive concept, the present invention also designs a system for realizing the multi-receptive field attention mechanism based on target detection, which is special in that:

包括主干网络、颈部模块和头部模块，Including backbone network, neck module and head module,

所述主干网络采用retnet50分类网络，对输入的图片提取特征，输出四个语义特征不同的特征向量。Described backbone network adopts retnet50 classification network, extracts feature to the picture of input, outputs four feature vectors with different semantic features.

所述颈部模块分别以四个互不相同的卷积核对输入四个特征向量的批处理、通道数、空间高度和宽度进行卷积，得到四个不同感受野的张量[x1,x2,x3,x4]，并将所有张量相加得到x5；The neck module uses four different convolution kernels to convolve the batch processing, channel number, spatial height and width of the input four feature vectors to obtain tensors [x1, x2, x3,x4], and add all tensors to get x5;

将以上5个张量分别在通道维度上分为两组，使用两个不同卷积核大小的C 模块来获得每组的通道权重，然后将通道维度中的两个组连接起来，以获得每个张量的权重；Divide the above 5 tensors into two groups in the channel dimension, use two C modules with different convolution kernel sizes to obtain the channel weights of each group, and then connect the two groups in the channel dimension to obtain each weights of tensors;

X_{5_1},X_{5_2}＝F_spX₅ X _{5_1} , X _{5_2} = F _sp X ₅

最后融合所有通道权重，然后将权重乘以对应的输入向量。并在通道打乱后得到输出；通道打乱算子是在不增加计算量的情况下整合通道；Finally, all channel weights are fused, and the weights are multiplied by the corresponding input vectors. And get the output after the channel is shuffled; the channel shuffling operator is to integrate the channel without increasing the amount of calculation;

其中F_cs是通道打乱算子，⊙是乘法算子；Where F _cs is the channel scrambling operator, ⊙ is the multiplication operator;

所述头部模块接收颈部输入的特征图，并预测所有的物体，通过划分正负样本计算损失，再经过反向传播优化，实现不同尺寸的物体的分类回归。The head module receives the feature map input by the neck, and predicts all objects, calculates the loss by dividing positive and negative samples, and then optimizes through back propagation to realize the classification and regression of objects of different sizes.

本发明的优点在于：The advantages of the present invention are:

使用不同卷积核大小的卷积处理扩大了特征向量的感受野，使得提取到的特征融合了更丰富的语义特征；Convolution processing using different convolution kernel sizes expands the receptive field of the feature vector, making the extracted features incorporate richer semantic features;

C模块实现局部跨通道交互，降低了通道维度下降所带来的信息损失，同时也简化了模型。The C module realizes local cross-channel interaction, which reduces the information loss caused by the decrease of channel dimension, and also simplifies the model.

分组提取权重可以更好地衡量不同通道的特征重要程度。Grouping extraction weights can better measure the feature importance of different channels.

权重融合可以平衡权重分布。Weight fusion can balance weight distribution.

通道打乱可以在不增加计算量的前提下，使通道充分融合。Channel scrambling can fully integrate channels without increasing the amount of calculation.

附图说明Description of drawings

图1为多感受野注意力机制结构示意图；Figure 1 is a schematic diagram of the multi-receptive field attention mechanism;

图2为C模块结构示意图；Figure 2 is a schematic diagram of the structure of the C module;

图3为网络总体结构示意图Figure 3 is a schematic diagram of the overall network structure

图4为不同网络检测效果图Figure 4 is the effect diagram of different network detection

具体实施方式detailed description

下面结合附图和实施例对本发明的技术方案作进一步说明。The technical solutions of the present invention will be further described below in conjunction with the accompanying drawings and embodiments.

本发明所设计的多感受野注意力机制结构如图1所示。X表示输入特征图，其大小为[B,C,H,W]，其中B、C、H、W分别表示批处理、通道数、空间高度和宽度。首先它使用1*1、3*3、5*5、7*7卷积对X进行卷积，得到四个不同感受野的张量[x1,x2,x3,x4].它们的大小都是[B,C,H,W]，然后相加得x5。The structure of the multi-receptive field attention mechanism designed in the present invention is shown in FIG. 1 . X represents the input feature map, and its size is [B, C, H, W], where B, C, H, W represent the batch, number of channels, spatial height and width, respectively. First, it uses 1*1, 3*3, 5*5, 7*7 convolutions to convolve X to get four tensors [x1,x2,x3,x4] with different receptive fields. Their sizes are [B,C,H,W], then add to get x5.

它将每个张量在通道维度上均分为2组。每组的大小为[B,C//2,H,W]。它使用两个不同卷积核大小的C模块来获得每组的通道权重。卷积核大小分别为[3,5]。然后将通道维度中的两个组连接起来，以获得每个张量的权重。采用不同的卷积核对跨通道交互信息的捕获是不同的，因此分两组可综合其效果。It divides each tensor equally into 2 groups in the channel dimension. The size of each group is [B,C//2,H,W]. It uses two C modules with different kernel sizes to obtain channel weights for each group. The convolution kernel sizes are [3,5] respectively. The two groups in the channel dimension are then concatenated to obtain weights for each tensor. Different convolution kernels are used to capture cross-channel interaction information differently, so the effects can be synthesized by dividing them into two groups.

C模块的结构如图2所示。X_CH为输入特征图，其大小为[B,C,H,W]，其中 B、C、H、W分别表示批处理、通道数、空间高度和宽度。它通过全局平均池化操作获得X_a,X_a∈R^(B,C,1,1)。为了避免模型过于复杂，对X_a进行挤压置换，然后得到X_s,X_s∈R^(B,1,C)。之后，使用k*k的卷积核实现局部跨通道交互得到 X_c,X_c∈R^(B,1,C)。X_sg,X_sg∈R^(B,1,C)通过sigmoid激活函数获得。最后，解压和置换 X_sg，然后得到X_weight,X_weight∈R^(B,C,1,1)。The structure of the C module is shown in Figure 2. X _CH is the input feature map, whose size is [B, C, H, W], where B, C, H, and W represent the batch, number of channels, spatial height, and width, respectively. It obtains X _a , X _a ∈ R ^(B,C,1,1) through global average pooling operation. In order to avoid the model being too complicated, X _a is squeezed and replaced to obtain X _s , X _s ∈ R ^(B,1,C) . Afterwards, the k*k convolution kernel is used to achieve local cross-channel interaction to obtain X _c , X _c ∈ R ^(B,1,C) . X _sg , X _sg ∈ R ^(B,1,C) is obtained through the sigmoid activation function. Finally, decompress and replace X _sg , and then get X _weight , X _weight ∈R ^(B,C,1,1) .

C模块由公式1表示。The C module is represented by Equation 1.

F_{C module k*k}(X_CH)＝F_unF_sgW_1dF_sF_aX_CH (1)F _{C module k*k} (X _CH )＝F _un F _sg W _1d F _s F _a X _CH (1)

其中F_a是一个自适应平均池化算子，F_sg是一个sigmoid算子，W_1d是一个k×k 卷积层，F_s是一个压缩和交换算子，F_un是一个解压缩和交换算子。weight5的输出由公式2表示。where F _a is an adaptive average pooling operator, F _sg is a sigmoid operator, W _1d is a k×k convolutional layer, F _s is a compression and exchange operator, F _un is a decompression and exchange operator. The output of weight5 is expressed by Equation 2.

其中F_sp是组运算算子，concat是拼接算子。第五个权重是前四个权重的平均，将它引入一定程度能平衡权重分布。Among them, F _sp is a group operation operator, and concat is a splicing operator. The fifth weight is the average of the first four weights, which can be introduced to a certain extent to balance the weight distribution.

最后，融合所有通道权重，然后将权重乘以x。它在通道打乱后得到输出。Finally, all channel weights are fused, and the weights are multiplied by x. It gets the output after channel scrambling.

多感受野注意力机制的输出由公式3表示。The output of the multi-receptive field attention mechanism is expressed by Equation 3.

通道打乱算子是在不增加计算量的情况下整合通道。X,X∈R^(B,C,H,W)扩展成 X_cs,X_cs,∈R^{(B,G,C//G,H,W)}，再改变X_cs得到X_sc,X_sc∈R^{(B,C//G,G,H,W)}。最后还原成 X,X∈R^(B,C,H,W)，实现全局信息交互。The channel scrambling operator integrates channels without increasing the amount of computation. X,X∈R ^(B,C,H,W) expands into X _cs ,X _cs ,∈R ^{(B,G,C//G,H,W)} , and then changes X _cs to get X _sc ,X _sc ∈ R ^{(B,C//G,G,H,W)} . Finally, it is restored to X,X∈R ^(B,C,H,W) to realize global information interaction.

通常提取权重都是单独使用全局平均池化，然后使用两个具有非线性的完全连接层，然后使用一个Sigmoid函数来生成通道权值。两个连接层的设计是为了捕捉非线性的跨通道交互，其中包括降维来控制模型的复杂性。虽然该策略在后续的通道注意模块中得到了广泛的应用，但是降维对通道注意预测带来了副作用，捕获所有通道之间的依赖是低效的，也是不必要的。该模块避免了降维，有效捕获了跨通道交互的信息，同时也降低了参数量。Usually the weights are extracted using global average pooling alone, then using two fully connected layers with non-linearity, and then using a Sigmoid function to generate channel weights. Two connection layers are designed to capture non-linear cross-channel interactions, which includes dimensionality reduction to control model complexity. Although this strategy has been widely used in subsequent channel attention modules, dimensionality reduction has a side effect on channel attention prediction, and capturing dependencies between all channels is inefficient and unnecessary. This module avoids dimensionality reduction, effectively captures the information of cross-channel interaction, and also reduces the amount of parameters.

本发明是以MFPN网络为基础的优化神经网络，其步骤包括：The present invention is based on the optimization neural network of MFPN network, and its steps comprise:

步骤1：数据输入和优化策略。Step 1: Data entry and optimization strategy.

PASCAL VOC数据集使用PASCAL VOC 2007和PASCAL VOC 2012。它们共有21个类别，16551个训练图像和16492个测试图像。The PASCAL VOC dataset uses PASCAL VOC 2007 and PASCAL VOC 2012. They have a total of 21 categories, 16551 training images and 16492 testing images.

Ms CoCo2017数据集共有80个类别和118,287张图像。它涵盖了生活中最常见的物体，是一个丰富的物体检测数据集。The Ms CoCo2017 dataset has a total of 80 categories and 118,287 images. It covers the most common objects in life and is a rich object detection dataset.

将所有图像裁剪为512*512进行训练，使用SGD优化器，设置学习率为0.001，动量为0.9，权重衰减为0.0001。学习率采用步进调整策略，迭代周期为12个 epoch。Crop all images to 512*512 for training, use the SGD optimizer, set the learning rate to 0.001, momentum to 0.9, and weight decay to 0.0001. The learning rate adopts a stepwise adjustment strategy, and the iteration period is 12 epochs.

步骤2：模型的构建。Step 2: Construction of the model.

本发明的网络如图3所示，由主干网络、颈部模块和头部模块三部分构成。主干网络采用的是resnet50,用于提取图片的特征，该网络输出4个不同尺寸的特征图[C2,C3,C4,C5]，步距为[4，8,16,32]，通道大小为[256,512,1024,2048]。颈部模块结构用于连接主干网络和heads，用于融合特征。该结构采用了主干网络的三个特征图[C3,C4,C5]，经过1*1卷积后通道都降为256,然后对所有的通道都采用多感受野注意力机制处理，在使用FPN结构特征融合，，再对融合后的特征采用多感受野注意力机制处理，最后采用3*3卷积对特征图进行消融处理，输出5 个不同尺寸的特征图[P3,P4,P5,P6,P7]，步距为[8,16,32,64,128]，通道大小都为 256。头部模块用于物体的检测，实现目标的分类和回归。As shown in Figure 3, the network of the present invention consists of three parts: a backbone network, a neck module and a head module. The backbone network uses resnet50, which is used to extract the features of the picture. The network outputs 4 feature maps of different sizes [C2, C3, C4, C5], the step size is [4, 8, 16, 32], and the channel size is [256, 512, 1024, 2048]. The neck module structure is used to connect the backbone network and heads for feature fusion. This structure uses three feature maps [C3, C4, C5] of the backbone network. After 1*1 convolution, the channels are reduced to 256, and then all channels are processed by a multi-receptive field attention mechanism. FPN is used. Structural feature fusion, and then use the multi-receptive field attention mechanism to process the fused features, and finally use 3*3 convolution to ablate the feature map, and output 5 feature maps of different sizes [P3, P4, P5, P6 ,P7], the step size is [8,16,32,64,128], and the channel size is 256. The head module is used for object detection, classification and regression of targets.

步骤3：训练测试。Step 3: Train Test.

实验的评价标准采用平均精度(Average-Precision，AP)， AP50,AP75,AP_S,AP_M,AP_L作为主要评价标准。The evaluation standard of the experiment adopts the average precision (Average-Precision, AP), AP50, AP75, AP_S, AP_M, AP_L as the main evaluation standard.

硬件：CPU:Intel Xeon E5-2683 V3@2.00GHz；RAM:32GB；Graphics card: NvidiaGTX 1080Ti；Hard disk:500GB.Hardware: CPU: Intel Xeon E5-2683 V3@2.00GHz; RAM: 32GB; Graphics card: NvidiaGTX 1080Ti; Hard disk: 500GB.

软件:MMdetection2.6；PyTorch1.6.0；Torchvision＝0.7.0；CUDA10.0；CUDNN7.4.Software: MMdetection2.6; PyTorch1.6.0; Torchvision＝0.7.0; CUDA10.0; CUDNN7.4.

本发明测试了多感受野注意力机制对物体精度的影响，并在多个网络上进行了对比实验，实验结果如表1所示。The present invention has tested the impact of multi-receptive field attention mechanism on object precision, and has carried out comparative experiment on multiple networks, and experimental result is as shown in table 1.

表1多感受野注意力机制在不同网络上对Ms CoCo2017数据集的影响，×表示没有attention机制，√表示有注意力机制Table 1 The influence of the multi-receptive field attention mechanism on the Ms CoCo2017 dataset on different networks, × means there is no attention mechanism, √ means there is an attention mechanism

MFPN结构在四个网络上都有不同程度的改进。ATSS的AP从32.7％增加到 33.7％，AP50和AP_L甚至增加了2％。FCOS的AP从29.1％提高了0.9％，其他指标也可以提高1％左右。VFNet的AP从34.1％仅增加了0.4％，但AP_L从50.5％增加到了53.0％。多感受野注意力机制在Foveabox中提升最为明显，其AP提升了 2.6％，AP_L从43.8％提升到了47.3％。The MFPN structure improves to varying degrees on the four networks. The AP of ATSS increases from 32.7% to 33.7%, and AP50 and AP_L even increase by 2%. The AP of FCOS has increased by 0.9% from 29.1%, and other indicators can also increase by about 1%. The AP of VFNet only increased by 0.4% from 34.1%, but the AP_L increased from 50.5% to 53.0%. The multi-receptive field attention mechanism has the most obvious improvement in Foveabox, its AP has increased by 2.6%, and its AP_L has increased from 43.8% to 47.3%.

不同感受野的特征图对不同大小的物体检测有不同的影响。多感受野注意力机制结构整合了4个不同感受野的特征图，可以有效平衡不同大小的物体。并且它对不同通道权重的提取还可以增强重要特征并减少冗余，实验结果表明多感受野注意力机制使用不同感受野的特征图是有效的。Feature maps with different receptive fields have different effects on object detection with different sizes. The multi-receptive field attention mechanism structure integrates feature maps of four different receptive fields, which can effectively balance objects of different sizes. And its extraction of different channel weights can also enhance important features and reduce redundancy. Experimental results show that the multi-receptive field attention mechanism is effective in using feature maps of different receptive fields.

图4显示了不同网络的检测效果。Faster-RCNN在检测上有明显的冗余。第一张、第二张和第三张图片检测到很多冗余物体，第二张图片没有准确返回汽车的位置。FCOS漏检明显。第一张图中没有检测到电视和右边的人，第三张图中也漏掉了笔记本电脑。FoveaBox的第一张和第三张图像有冗余检测，第四张图像错误地将猫检测为熊。ATSS还错误地将第四张图像检测为熊。本发明提出的网络不仅可以准确检测目标，而且误检率和冗余率低。与其他网络相比，其检测精度优于这些网络。Figure 4 shows the detection effect of different networks. Faster-RCNN has obvious redundancy in detection. The first, second, and third images detected a lot of redundant objects, and the second image did not accurately return the position of the car. The missed detection of FCOS is obvious. The TV and the person on the right are not detected in the first image, and the laptop is also missing in the third image. The first and third images of FoveaBox have redundant detections, and the fourth image incorrectly detects a cat as a bear. ATSS also incorrectly detected the fourth image as a bear. The network proposed by the invention can not only accurately detect the target, but also has low false detection rate and redundancy rate. Compared with other networks, its detection accuracy is better than these networks.

本文中所描述的具体实施例仅仅是对本发明精神作举例说明。本发明所属技术领域的技术人员可以对所描述的具体实施例做各种各样的修改或补充或采用类似的方式替代，但并不会偏离本发明的精神或者超越所附权利要求书所定义的范围。The specific embodiments described herein are merely illustrative of the spirit of the invention. Those skilled in the art to which the present invention belongs can make various modifications or supplements to the described specific embodiments or adopt similar methods to replace them, but they will not deviate from the spirit of the present invention or go beyond the definition of the appended claims range.

Claims

1. a multi-receptive field attention mechanism based on target detection, is characterized in that, comprises the steps:

Step 1. Convolute the batch, channel number, spatial height and width of the input image with four different convolution kernels to obtain four tensors [x1,x2,x3,x4] with different receptive fields. and sum all tensors to get x5;

Step 2, divide the above 5 tensors into two groups in the channel dimension, use two C modules with different convolution kernel sizes to obtain the channel weight of each group, and then connect the two groups in the channel dimension, to get the weight of each tensor;

Among them, the C module is represented by the following formula:

F _{C module k*k} (X _CH )＝F _un F _sg W _1d F _s F _a X _CH

Among them, X _CH is the input feature map, F _a is an adaptive average pooling operator, F _sg is a sigmoid operator, W _1d is a k×k convolutional layer, F _s is a compression and exchange operator, F _un is a decompression and exchange operator, k*k is the convolution kernel used in local cross-channel interaction; the output formula of weight 5 is:

X _{5_1} , X _{5_2} = F _sp X ₅

weight5＝concat(F _{C module 3*3} (X _{5_1} ),F _{C module 5*5} (X _{5_2} ))

Among them, F _sp is a group operation operator, and concat is a splicing operator;

Step 3, finally, fuse all channel weights, and then multiply the weights by the corresponding input vector; and get the output after the channels are shuffled; the channel shuffling operator integrates the channels without increasing the amount of calculation;

The output formula 3 of the multi-receptive field attention mechanism is:

Among them, F _cs is the channel scrambling operator, and ⊙ is the multiplication operator.

2. The multi-receptive field attention mechanism based on target detection according to claim 1, characterized in that: in step 1, four different convolution kernels are arbitrarily selected from numbers 1-9.

3. The multi-receptive field attention mechanism based on target detection according to claim 2, wherein the four different convolutions are 1*1, 3*3, 5*5, and 7*7 respectively.

4. The multi-receptive field attention mechanism based on target detection according to claim 1, characterized in that: the sizes of the convolution kernels in step 2 are [3, 5].

5. The multi-receptive field attention mechanism based on target detection according to claim 1, characterized in that: X _CH is an input feature map, and its size is [B, C, H, W], where B, C, H , W represent batch processing, channel number, space height and width respectively; C module obtains X _a , X _a ∈ R ^(B,C,1,1) through global average pooling operation, in order to avoid the model is too complicated, for X _a is squeezed and replaced, and then X _s , X _s ∈ R ^{(B,1,C) is} obtained; after that, the k*k convolution kernel is used to realize local cross-channel interaction to obtain X _c , X _c ∈ R ^(B ^{,1 ,C)} ; X _sg , X _sg ∈ R ^(B,1,C) is obtained through the sigmoid activation function; finally, decompress and replace X _sg , and then get the weight X _weight , X _weight ∈ R ^{(B,C,1,1 )} ;

The channel scrambling operator is to integrate channels without increasing the amount of calculation; X,X∈R ^{(B,C,H,W) is} expanded into X _cs ,X _cs ,∈R ^{(B,G,C//G ,H,W)} , then change X _cs to get X _sc ,X _sc ∈R ^{(B,C//G,G,H,W)} ; finally restore to X,X∈R ^(B,C,H,W) , to achieve global information exchange.

6. A network for realizing the arbitrary described multi-receptive field attention mechanism based on target detection of claim 1-5, characterized in that:

Including backbone network, neck module and head module,

The backbone network uses four different convolution kernels to convolve the batch processing, channel number, spatial height and width of the input image to obtain four tensors [x1,x2,x3,x4] with different receptive fields , and add all tensors to get x5;

The neck module divides the above five tensors into two groups in the channel dimension, uses two C modules with different convolution kernel sizes to obtain the channel weights of each group, and then connects the two groups in the channel dimension up to get the weights of each tensor;

Among them, the C module is represented by the following formula:

F _C m _{odule k*k} (X _CH )＝F _un F _sg W _1d F _s F _a X _CH

X _{5_1} , X _{5_2} = F _sp X ₅

weight5＝concat(F _{C module 3*3} (X _{5_1} ),F _{C module 5*5} (X _{5_2} ))

The head module fuses all channel weights, and then multiplies the weights by the corresponding input vectors; and obtains the output after the channels are scrambled; the channel scramble operator integrates the channels without increasing the amount of calculation;

The output formula 3 of the multi-receptive field attention mechanism is: