CN115439706A - A multi-receptive field attention mechanism and system based on target detection - Google Patents
A multi-receptive field attention mechanism and system based on target detection Download PDFInfo
- Publication number
- CN115439706A CN115439706A CN202210523305.5A CN202210523305A CN115439706A CN 115439706 A CN115439706 A CN 115439706A CN 202210523305 A CN202210523305 A CN 202210523305A CN 115439706 A CN115439706 A CN 115439706A
- Authority
- CN
- China
- Prior art keywords
- channel
- operator
- module
- attention mechanism
- receptive field
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000007246 mechanism Effects 0.000 title claims abstract description 36
- 238000001514 detection method Methods 0.000 title claims abstract description 27
- 230000003993 interaction Effects 0.000 claims abstract description 13
- 239000013598 vector Substances 0.000 claims abstract description 8
- 238000011176 pooling Methods 0.000 claims description 10
- 238000004364 calculation method Methods 0.000 claims description 7
- 230000003044 adaptive effect Effects 0.000 claims description 5
- 230000006835 compression Effects 0.000 claims description 5
- 238000007906 compression Methods 0.000 claims description 5
- 230000006837 decompression Effects 0.000 claims description 5
- 230000006870 function Effects 0.000 claims description 4
- 238000012545 processing Methods 0.000 claims description 4
- 230000004913 activation Effects 0.000 claims description 3
- 230000008859 change Effects 0.000 claims description 2
- 230000000694 effects Effects 0.000 abstract description 6
- 238000000605 extraction Methods 0.000 abstract description 6
- 230000009467 reduction Effects 0.000 abstract description 4
- 238000002474 experimental method Methods 0.000 abstract description 3
- 238000012800 visualization Methods 0.000 abstract 1
- 238000000034 method Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 238000013527 convolutional neural network Methods 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 3
- 230000004927 fusion Effects 0.000 description 3
- 102100031315 AP-2 complex subunit mu Human genes 0.000 description 2
- 101000796047 Homo sapiens AP-2 complex subunit mu Proteins 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 238000001125 extrusion Methods 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 241000282326 Felis catus Species 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013479 data entry Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Databases & Information Systems (AREA)
- Computing Systems (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
Abstract
Description
技术领域technical field
本发明属于基于深度学习的目标检测领域。具体地指一种基于目标检测的多 感受野注意力机制及系统。The invention belongs to the field of target detection based on deep learning. Specifically, it refers to a multi-receptive field attention mechanism and system based on target detection.
背景技术Background technique
随着计算机的技术理论的发展和硬件设备的提高,基于深度学习的目标检测 得到广泛地应用。由于卷积神经网络(CNN)可以替代人类手工提取特征等优 点,CNN被普遍应用于提取图片特征。With the development of computer technology theory and the improvement of hardware equipment, object detection based on deep learning has been widely used. Due to the advantages of convolutional neural network (CNN) that can replace human manual feature extraction, CNN is widely used to extract image features.
网络对语义特征提取的丰富程度直接有影响到目标检测算法的精度。注意力 机制主要用于加强网络对重要的通道特征关注或者提高网络对图片中感兴趣区 域的注意,该方法常用于深度学习算法中用于提高语义信息。The richness of the network for semantic feature extraction directly affects the accuracy of the target detection algorithm. The attention mechanism is mainly used to strengthen the network's attention to important channel features or to improve the network's attention to the region of interest in the picture. This method is often used in deep learning algorithms to improve semantic information.
注意力机制源于对人类视觉的研究,最早应用于自然语言领域,用于实现信 息处理资源的高效分配。近年来,注意力机制在计算机视觉领域得到快速发展。 2017年,Dongcai Cheng等人提出了SENET,这个注意力机制通过挤压,激励, 以及scale三部分实现了通道注意力机制,SENET直接采用两个全连接层来提取 特征权重,但是其并不能很好的捕获通道间特征的依赖性。CABM在SENET的基 础上在挤压阶段增加了一条最大池化并行支路,使得获得的信息更加全面,但是 它任然采用了两个全连接层提取权重。The attention mechanism originated from the study of human vision and was first applied in the field of natural language to realize the efficient allocation of information processing resources. In recent years, the attention mechanism has been developed rapidly in the field of computer vision. In 2017, Dongcai Cheng and others proposed SENET. This attention mechanism realizes the channel attention mechanism through extrusion, incentives, and scale. SENET directly uses two fully connected layers to extract feature weights, but it is not very good. Good capture of feature dependencies between channels. Based on SENET, CABM adds a maximum pooling parallel branch in the extrusion stage, which makes the obtained information more comprehensive, but it still uses two fully connected layers to extract weights.
为了提升网络对图片的特征提取,注意力机制通常采用扩大感受野增强全局 信息,提取权重加强或者抑制通道特征等方式。本文提出了多感受野注意力机制。 它使用4个不同感受野的平行分支,既扩大了感受野,也防止了因感受野过大 而导致的细节丢失。每个分支分为两组,使用不同大小的一维卷积来获得通道权 重,它可有效地防止因通道下降而带来的通道依赖损失。In order to improve the network's feature extraction of pictures, the attention mechanism usually adopts methods such as expanding the receptive field to enhance global information, extracting weights to strengthen or suppress channel features, etc. This paper proposes a multi-receptive field attention mechanism. It uses 4 parallel branches with different receptive fields, which not only expands the receptive field, but also prevents the loss of details caused by too large receptive field. Each branch is divided into two groups, and one-dimensional convolutions of different sizes are used to obtain channel weights, which can effectively prevent channel dependency loss caused by channel drop.
发明内容Contents of the invention
为了克服上述不足,本发明提出了一种基于目标检测的多感受野注意力机制 及系统,既扩大了感受野,又有效地地防止因通道下降而带来的通道依赖损失。In order to overcome the above shortcomings, the present invention proposes a multi-receptive field attention mechanism and system based on target detection, which not only expands the receptive field, but also effectively prevents the loss of channel dependence caused by channel drop.
本发明所设计的一种基于目标检测的多感受野注意力机制,其特殊之处在于, 包括如下步骤:A kind of multi-receptive field attention mechanism based on target detection designed by the present invention is special in that it includes the following steps:
步骤1,分别以四个互不相同的卷积核对输入图像的批处理、通道数、空间 高度和宽度进行卷积,得到四个不同感受野的张量[x1,x2,x3,x4],并将所有张量 相加得到x5;
步骤2,将以上5个张量分别在通道维度上分为两组,使用两个不同卷积核 大小的C模块来获得每组的通道权重,然后将通道维度中的两个组连接起来, 以获得每个张量的权重;Step 2, divide the above 5 tensors into two groups in the channel dimension, use two C modules with different convolution kernel sizes to obtain the channel weight of each group, and then connect the two groups in the channel dimension, to get the weight of each tensor;
其中,C模块由以下公式表示:Among them, the C module is represented by the following formula:
FC module k*k(XCH)=FunFsgW1dFsFaXCH F C module k*k (X CH )=F un F sg W 1d F s F a X CH
其中,XCH为输入特征图,Fa是一个自适应平均池化算子,Fsg是一个sigmoid 算子,W1d是一个k×k卷积层,Fs是一个压缩和交换算子,Fun是一个解压缩和 交换算子,k*k为局部跨通道交互中使用的卷积核;权重5的输出公式为。Among them, X CH is the input feature map, F a is an adaptive average pooling operator, F sg is a sigmoid operator, W 1d is a k×k convolutional layer, F s is a compression and exchange operator, F un is a decompression and exchange operator, k*k is the convolution kernel used in local cross-channel interaction; the output formula of
X5_1,X5_2=FspX5 X 5_1 , X 5_2 = F sp X 5
weight5=concat(FC module 3*3(X5_1),FC module 5*5(X5_2))weight5=concat(F C module 3*3 (X 5_1 ),F C module 5*5 (X 5_2 ))
其中Fsp是组运算算子,concat是拼接算子;Among them, F sp is a group operation operator, and concat is a splicing operator;
步骤3,最后,融合所有通道权重,然后将权重乘以对应输入向量。并在通 道打乱后得到输出;通道打乱算子是在不增加计算量的情况下整合通道;Step 3. Finally, fuse all channel weights, and then multiply the weights by the corresponding input vectors. And get the output after the channel is shuffled; the channel shuffling operator is to integrate the channel without increasing the amount of calculation;
多感受野注意力机制的输出公式3为:The output formula 3 of the multi-receptive field attention mechanism is:
其中Fcs是通道打乱算子,⊙是乘法算子。Among them, F cs is the channel scrambling operator, and ⊙ is the multiplication operator.
进一步地,步骤1中四个不同的卷积核在数字1-9中任意选取。Further, the four different convolution kernels in
进一步地,所述四个不同的卷积分别为1*1、3*3、5*5、7*7。Further, the four different convolutions are 1*1, 3*3, 5*5, and 7*7 respectively.
进一步地,步骤2中卷积核大小分别为[3,5]。Further, the convolution kernel sizes in step 2 are [3, 5].
进一步地,XCH为输入特征图,其大小为[B,C,H,W],其中B、C、H、W分 别表示批处理、通道数、空间高度和宽度。C模块它通过全局平均池化操作获得 Xa,Xa∈R(B,C,1,1),为了避免模型过于复杂,对Xa进行挤压置换,然后得到Xs,Xs∈R(B,1,C);之后,使用k*k的卷积核实现局部跨通道交互得到 Xc,Xc∈R(B,1,C);Xsg,Xsg∈R(B,1,C)通过sigmoid激活函数获得;最后,解压和置 换Xsg,然后得权重Xweight,Xweight∈R(B,C,1,1);Further, X CH is the input feature map, whose size is [B, C, H, W], where B, C, H, W denote the batch, number of channels, spatial height and width, respectively. C module obtains X a , X a ∈ R (B,C,1,1) through the global average pooling operation. In order to avoid the model being too complicated, squeeze and replace X a , and then get X s , X s ∈ R (B,1,C) ; After that, use the k*k convolution kernel to achieve local cross-channel interaction to get X c , X c ∈ R (B,1,C) ; X sg , X sg ∈ R (B,1 ,C) is obtained through the sigmoid activation function; finally, decompress and replace X sg , and then get the weight X weight , X weight ∈ R (B,C,1,1) ;
通道打乱算子是在不增加计算量的情况下整合通道;X,X∈R(B,C,H,W)扩展成 Xcs,Xcs,∈R(B,G,C//G,H,W),再改变Xcs得到Xsc,Xsc∈R(B,C//G,G,H,W);最后还原成 X,X∈R(B,C,H,W),实现全局信息交互。The channel scrambling operator is to integrate channels without increasing the amount of calculation; X,X∈R (B ,C,H ,W) is expanded into X cs ,X cs ,∈R (B,G,C//G ,H,W) , then change X cs to get X sc ,X sc ∈R (B,C//G,G,H,W) ; finally restore to X,X∈R (B ,C,H ,W) , to achieve global information exchange.
基于同一发明构思,本发明还设计了一种用于实现所述基于目标检测的多感 受野注意力机制的系统,其特殊之处在于:Based on the same inventive concept, the present invention also designs a system for realizing the multi-receptive field attention mechanism based on target detection, which is special in that:
包括主干网络、颈部模块和头部模块,Including backbone network, neck module and head module,
所述主干网络采用retnet50分类网络,对输入的图片提取特征,输出四个语 义特征不同的特征向量。Described backbone network adopts retnet50 classification network, extracts feature to the picture of input, outputs four feature vectors with different semantic features.
所述颈部模块分别以四个互不相同的卷积核对输入四个特征向量的批处理、 通道数、空间高度和宽度进行卷积,得到四个不同感受野的张量[x1,x2,x3,x4], 并将所有张量相加得到x5;The neck module uses four different convolution kernels to convolve the batch processing, channel number, spatial height and width of the input four feature vectors to obtain tensors [x1, x2, x3,x4], and add all tensors to get x5;
将以上5个张量分别在通道维度上分为两组,使用两个不同卷积核大小的C 模块来获得每组的通道权重,然后将通道维度中的两个组连接起来,以获得每个 张量的权重;Divide the above 5 tensors into two groups in the channel dimension, use two C modules with different convolution kernel sizes to obtain the channel weights of each group, and then connect the two groups in the channel dimension to obtain each weights of tensors;
其中,C模块由以下公式表示:Among them, the C module is represented by the following formula:
FC module k*k(XCH)=FunFsgW1dFsFaXCH F C module k*k (X CH )=F un F sg W 1d F s F a X CH
其中,XCH为输入特征图,Fa是一个自适应平均池化算子,Fsg是一个sigmoid 算子,W1d是一个k×k卷积层,Fs是一个压缩和交换算子,Fun是一个解压缩和 交换算子,k*k为局部跨通道交互中使用的卷积核;权重5的输出公式为。Among them, X CH is the input feature map, F a is an adaptive average pooling operator, F sg is a sigmoid operator, W 1d is a k×k convolutional layer, F s is a compression and exchange operator, F un is a decompression and exchange operator, k*k is the convolution kernel used in local cross-channel interaction; the output formula of
X5_1,X5_2=FspX5 X 5_1 , X 5_2 = F sp X 5
weight5=concat(FC module 3*3(X5_1),FC module 5*5(X5_2))weight5=concat(F C module 3*3 (X 5_1 ),F C module 5*5 (X 5_2 ))
其中Fsp是组运算算子,concat是拼接算子;Among them, F sp is a group operation operator, and concat is a splicing operator;
最后融合所有通道权重,然后将权重乘以对应的输入向量。并在通道打乱后 得到输出;通道打乱算子是在不增加计算量的情况下整合通道;Finally, all channel weights are fused, and the weights are multiplied by the corresponding input vectors. And get the output after the channel is shuffled; the channel shuffling operator is to integrate the channel without increasing the amount of calculation;
多感受野注意力机制的输出公式3为:The output formula 3 of the multi-receptive field attention mechanism is:
其中Fcs是通道打乱算子,⊙是乘法算子;Where F cs is the channel scrambling operator, ⊙ is the multiplication operator;
所述头部模块接收颈部输入的特征图,并预测所有的物体,通过划分正负样本计算损失,再经过反向传播优化,实现不同尺寸的物体的分类回归。The head module receives the feature map input by the neck, and predicts all objects, calculates the loss by dividing positive and negative samples, and then optimizes through back propagation to realize the classification and regression of objects of different sizes.
本发明的优点在于:The advantages of the present invention are:
使用不同卷积核大小的卷积处理扩大了特征向量的感受野,使得提取到的特 征融合了更丰富的语义特征;Convolution processing using different convolution kernel sizes expands the receptive field of the feature vector, making the extracted features incorporate richer semantic features;
C模块实现局部跨通道交互,降低了通道维度下降所带来的信息损失,同时 也简化了模型。The C module realizes local cross-channel interaction, which reduces the information loss caused by the decrease of channel dimension, and also simplifies the model.
分组提取权重可以更好地衡量不同通道的特征重要程度。Grouping extraction weights can better measure the feature importance of different channels.
权重融合可以平衡权重分布。Weight fusion can balance weight distribution.
通道打乱可以在不增加计算量的前提下,使通道充分融合。Channel scrambling can fully integrate channels without increasing the amount of calculation.
附图说明Description of drawings
图1为多感受野注意力机制结构示意图;Figure 1 is a schematic diagram of the multi-receptive field attention mechanism;
图2为C模块结构示意图;Figure 2 is a schematic diagram of the structure of the C module;
图3为网络总体结构示意图Figure 3 is a schematic diagram of the overall network structure
图4为不同网络检测效果图Figure 4 is the effect diagram of different network detection
具体实施方式detailed description
下面结合附图和实施例对本发明的技术方案作进一步说明。The technical solutions of the present invention will be further described below in conjunction with the accompanying drawings and embodiments.
本发明所设计的多感受野注意力机制结构如图1所示。X表示输入特征图, 其大小为[B,C,H,W],其中B、C、H、W分别表示批处理、通道数、空间高度 和宽度。首先它使用1*1、3*3、5*5、7*7卷积对X进行卷积,得到四个不同感 受野的张量[x1,x2,x3,x4].它们的大小都是[B,C,H,W],然后相加得x5。The structure of the multi-receptive field attention mechanism designed in the present invention is shown in FIG. 1 . X represents the input feature map, and its size is [B, C, H, W], where B, C, H, W represent the batch, number of channels, spatial height and width, respectively. First, it uses 1*1, 3*3, 5*5, 7*7 convolutions to convolve X to get four tensors [x1,x2,x3,x4] with different receptive fields. Their sizes are [B,C,H,W], then add to get x5.
它将每个张量在通道维度上均分为2组。每组的大小为[B,C//2,H,W]。它使 用两个不同卷积核大小的C模块来获得每组的通道权重。卷积核大小分别为[3,5]。 然后将通道维度中的两个组连接起来,以获得每个张量的权重。采用不同的卷积 核对跨通道交互信息的捕获是不同的,因此分两组可综合其效果。It divides each tensor equally into 2 groups in the channel dimension. The size of each group is [B,C//2,H,W]. It uses two C modules with different kernel sizes to obtain channel weights for each group. The convolution kernel sizes are [3,5] respectively. The two groups in the channel dimension are then concatenated to obtain weights for each tensor. Different convolution kernels are used to capture cross-channel interaction information differently, so the effects can be synthesized by dividing them into two groups.
C模块的结构如图2所示。XCH为输入特征图,其大小为[B,C,H,W],其中 B、C、H、W分别表示批处理、通道数、空间高度和宽度。它通过全局平均池 化操作获得Xa,Xa∈R(B,C,1,1)。为了避免模型过于复杂,对Xa进行挤压置换,然 后得到Xs,Xs∈R(B,1,C)。之后,使用k*k的卷积核实现局部跨通道交互得到 Xc,Xc∈R(B,1,C)。Xsg,Xsg∈R(B,1,C)通过sigmoid激活函数获得。最后,解压和置换 Xsg,然后得到Xweight,Xweight∈R(B,C,1,1)。The structure of the C module is shown in Figure 2. X CH is the input feature map, whose size is [B, C, H, W], where B, C, H, and W represent the batch, number of channels, spatial height, and width, respectively. It obtains X a , X a ∈ R (B,C,1,1) through global average pooling operation. In order to avoid the model being too complicated, X a is squeezed and replaced to obtain X s , X s ∈ R (B,1,C) . Afterwards, the k*k convolution kernel is used to achieve local cross-channel interaction to obtain X c , X c ∈ R (B,1,C) . X sg , X sg ∈ R (B,1,C) is obtained through the sigmoid activation function. Finally, decompress and replace X sg , and then get X weight , X weight ∈R (B,C,1,1) .
C模块由公式1表示。The C module is represented by
FC module k*k(XCH)=FunFsgW1dFsFaXCH (1)F C module k*k (X CH )=F un F sg W 1d F s F a X CH (1)
其中Fa是一个自适应平均池化算子,Fsg是一个sigmoid算子,W1d是一个k×k 卷积层,Fs是一个压缩和交换算子,Fun是一个解压缩和交换算子。weight5的 输出由公式2表示。where F a is an adaptive average pooling operator, F sg is a sigmoid operator, W 1d is a k×k convolutional layer, F s is a compression and exchange operator, F un is a decompression and exchange operator. The output of weight5 is expressed by Equation 2.
其中Fsp是组运算算子,concat是拼接算子。第五个权重是前四个权重的平均, 将它引入一定程度能平衡权重分布。Among them, F sp is a group operation operator, and concat is a splicing operator. The fifth weight is the average of the first four weights, which can be introduced to a certain extent to balance the weight distribution.
最后,融合所有通道权重,然后将权重乘以x。它在通道打乱后得到输出。Finally, all channel weights are fused, and the weights are multiplied by x. It gets the output after channel scrambling.
多感受野注意力机制的输出由公式3表示。The output of the multi-receptive field attention mechanism is expressed by Equation 3.
其中Fcs是通道打乱算子,⊙是乘法算子。Among them, F cs is the channel scrambling operator, and ⊙ is the multiplication operator.
通道打乱算子是在不增加计算量的情况下整合通道。X,X∈R(B,C,H,W)扩展成 Xcs,Xcs,∈R(B,G,C//G,H,W),再改变Xcs得到Xsc,Xsc∈R(B,C//G,G,H,W)。最后还原成 X,X∈R(B,C,H,W),实现全局信息交互。The channel scrambling operator integrates channels without increasing the amount of computation. X,X∈R (B,C,H,W) expands into X cs ,X cs ,∈R (B,G,C//G,H,W) , and then changes X cs to get X sc ,X sc ∈ R (B,C//G,G,H,W) . Finally, it is restored to X,X∈R (B,C,H,W) to realize global information interaction.
通常提取权重都是单独使用全局平均池化,然后使用两个具有非线性的完全连接层,然后使用一个Sigmoid函数来生成通道权值。两个连接层的设计是为了捕捉 非线性的跨通道交互,其中包括降维来控制模型的复杂性。虽然该策略在后续的 通道注意模块中得到了广泛的应用,但是降维对通道注意预测带来了副作用,捕 获所有通道之间的依赖是低效的,也是不必要的。该模块避免了降维,有效捕获 了跨通道交互的信息,同时也降低了参数量。Usually the weights are extracted using global average pooling alone, then using two fully connected layers with non-linearity, and then using a Sigmoid function to generate channel weights. Two connection layers are designed to capture non-linear cross-channel interactions, which includes dimensionality reduction to control model complexity. Although this strategy has been widely used in subsequent channel attention modules, dimensionality reduction has a side effect on channel attention prediction, and capturing dependencies between all channels is inefficient and unnecessary. This module avoids dimensionality reduction, effectively captures the information of cross-channel interaction, and also reduces the amount of parameters.
本发明是以MFPN网络为基础的优化神经网络,其步骤包括:The present invention is based on the optimization neural network of MFPN network, and its steps comprise:
步骤1:数据输入和优化策略。Step 1: Data entry and optimization strategy.
PASCAL VOC数据集使用PASCAL VOC 2007和PASCAL VOC 2012。它 们共有21个类别,16551个训练图像和16492个测试图像。The PASCAL VOC dataset uses PASCAL VOC 2007 and PASCAL VOC 2012. They have a total of 21 categories, 16551 training images and 16492 testing images.
Ms CoCo2017数据集共有80个类别和118,287张图像。它涵盖了生活中 最常见的物体,是一个丰富的物体检测数据集。The Ms CoCo2017 dataset has a total of 80 categories and 118,287 images. It covers the most common objects in life and is a rich object detection dataset.
将所有图像裁剪为512*512进行训练,使用SGD优化器,设置学习率为0.001, 动量为0.9,权重衰减为0.0001。学习率采用步进调整策略,迭代周期为12个 epoch。Crop all images to 512*512 for training, use the SGD optimizer, set the learning rate to 0.001, momentum to 0.9, and weight decay to 0.0001. The learning rate adopts a stepwise adjustment strategy, and the iteration period is 12 epochs.
步骤2:模型的构建。Step 2: Construction of the model.
本发明的网络如图3所示,由主干网络、颈部模块和头部模块三部分构成。 主干网络采用的是resnet50,用于提取图片的特征,该网络输出4个不同尺寸的特 征图[C2,C3,C4,C5],步距为[4,8,16,32],通道大小为[256,512,1024,2048]。颈部 模块结构用于连接主干网络和heads,用于融合特征。该结构采用了主干网络的 三个特征图[C3,C4,C5],经过1*1卷积后通道都降为256,然后对所有的通道都采 用多感受野注意力机制处理,在使用FPN结构特征融合,,再对融合后的特征采 用多感受野注意力机制处理,最后采用3*3卷积对特征图进行消融处理,输出5 个不同尺寸的特征图[P3,P4,P5,P6,P7],步距为[8,16,32,64,128],通道大小都为 256。头部模块用于物体的检测,实现目标的分类和回归。As shown in Figure 3, the network of the present invention consists of three parts: a backbone network, a neck module and a head module. The backbone network uses resnet50, which is used to extract the features of the picture. The network outputs 4 feature maps of different sizes [C2, C3, C4, C5], the step size is [4, 8, 16, 32], and the channel size is [256, 512, 1024, 2048]. The neck module structure is used to connect the backbone network and heads for feature fusion. This structure uses three feature maps [C3, C4, C5] of the backbone network. After 1*1 convolution, the channels are reduced to 256, and then all channels are processed by a multi-receptive field attention mechanism. FPN is used. Structural feature fusion, and then use the multi-receptive field attention mechanism to process the fused features, and finally use 3*3 convolution to ablate the feature map, and
步骤3:训练测试。Step 3: Train Test.
实验的评价标准采用平均精度(Average-Precision,AP), AP50,AP75,AP_S,AP_M,AP_L作为主要评价标准。The evaluation standard of the experiment adopts the average precision (Average-Precision, AP), AP50, AP75, AP_S, AP_M, AP_L as the main evaluation standard.
硬件:CPU:Intel Xeon E5-2683 V3@2.00GHz;RAM:32GB;Graphics card: NvidiaGTX 1080Ti;Hard disk:500GB.Hardware: CPU: Intel Xeon E5-2683 V3@2.00GHz; RAM: 32GB; Graphics card: NvidiaGTX 1080Ti; Hard disk: 500GB.
软件:MMdetection2.6;PyTorch1.6.0;Torchvision=0.7.0;CUDA10.0;CUDNN7.4.Software: MMdetection2.6; PyTorch1.6.0; Torchvision=0.7.0; CUDA10.0; CUDNN7.4.
本发明测试了多感受野注意力机制对物体精度的影响,并在多个网络上进行 了对比实验,实验结果如表1所示。The present invention has tested the impact of multi-receptive field attention mechanism on object precision, and has carried out comparative experiment on multiple networks, and experimental result is as shown in table 1.
表1多感受野注意力机制在不同网络上对Ms CoCo2017数据集的影响,×表示没有attention机制,√表示有注意力机制Table 1 The influence of the multi-receptive field attention mechanism on the Ms CoCo2017 dataset on different networks, × means there is no attention mechanism, √ means there is an attention mechanism
MFPN结构在四个网络上都有不同程度的改进。ATSS的AP从32.7%增加到 33.7%,AP50和AP_L甚至增加了2%。FCOS的AP从29.1%提高了0.9%,其他指 标也可以提高1%左右。VFNet的AP从34.1%仅增加了0.4%,但AP_L从50.5% 增加到了53.0%。多感受野注意力机制在Foveabox中提升最为明显,其AP提升了 2.6%,AP_L从43.8%提升到了47.3%。The MFPN structure improves to varying degrees on the four networks. The AP of ATSS increases from 32.7% to 33.7%, and AP50 and AP_L even increase by 2%. The AP of FCOS has increased by 0.9% from 29.1%, and other indicators can also increase by about 1%. The AP of VFNet only increased by 0.4% from 34.1%, but the AP_L increased from 50.5% to 53.0%. The multi-receptive field attention mechanism has the most obvious improvement in Foveabox, its AP has increased by 2.6%, and its AP_L has increased from 43.8% to 47.3%.
不同感受野的特征图对不同大小的物体检测有不同的影响。多感受野注意力 机制结构整合了4个不同感受野的特征图,可以有效平衡不同大小的物体。并且 它对不同通道权重的提取还可以增强重要特征并减少冗余,实验结果表明多感受 野注意力机制使用不同感受野的特征图是有效的。Feature maps with different receptive fields have different effects on object detection with different sizes. The multi-receptive field attention mechanism structure integrates feature maps of four different receptive fields, which can effectively balance objects of different sizes. And its extraction of different channel weights can also enhance important features and reduce redundancy. Experimental results show that the multi-receptive field attention mechanism is effective in using feature maps of different receptive fields.
图4显示了不同网络的检测效果。Faster-RCNN在检测上有明显的冗余。第一 张、第二张和第三张图片检测到很多冗余物体,第二张图片没有准确返回汽车的 位置。FCOS漏检明显。第一张图中没有检测到电视和右边的人,第三张图中也 漏掉了笔记本电脑。FoveaBox的第一张和第三张图像有冗余检测,第四张图像错 误地将猫检测为熊。ATSS还错误地将第四张图像检测为熊。本发明提出的网络 不仅可以准确检测目标,而且误检率和冗余率低。与其他网络相比,其检测精度 优于这些网络。Figure 4 shows the detection effect of different networks. Faster-RCNN has obvious redundancy in detection. The first, second, and third images detected a lot of redundant objects, and the second image did not accurately return the position of the car. The missed detection of FCOS is obvious. The TV and the person on the right are not detected in the first image, and the laptop is also missing in the third image. The first and third images of FoveaBox have redundant detections, and the fourth image incorrectly detects a cat as a bear. ATSS also incorrectly detected the fourth image as a bear. The network proposed by the invention can not only accurately detect the target, but also has low false detection rate and redundancy rate. Compared with other networks, its detection accuracy is better than these networks.
本文中所描述的具体实施例仅仅是对本发明精神作举例说明。本发明所属技 术领域的技术人员可以对所描述的具体实施例做各种各样的修改或补充或采用 类似的方式替代,但并不会偏离本发明的精神或者超越所附权利要求书所定义的 范围。The specific embodiments described herein are merely illustrative of the spirit of the invention. Those skilled in the art to which the present invention belongs can make various modifications or supplements to the described specific embodiments or adopt similar methods to replace them, but they will not deviate from the spirit of the present invention or go beyond the definition of the appended claims range.
Claims (6)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210523305.5A CN115439706A (en) | 2022-05-13 | 2022-05-13 | A multi-receptive field attention mechanism and system based on target detection |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210523305.5A CN115439706A (en) | 2022-05-13 | 2022-05-13 | A multi-receptive field attention mechanism and system based on target detection |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115439706A true CN115439706A (en) | 2022-12-06 |
Family
ID=84241637
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210523305.5A Pending CN115439706A (en) | 2022-05-13 | 2022-05-13 | A multi-receptive field attention mechanism and system based on target detection |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115439706A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116402747A (en) * | 2023-02-24 | 2023-07-07 | 上海白春学人工智能科技工作室 | Multi-receptive-field attention lung nodule benign and malignant classification and identification system and method |
CN117764988A (en) * | 2024-02-22 | 2024-03-26 | 山东省计算中心(国家超级计算济南中心) | Road crack detection method and system based on heteronuclear convolution multi-receptive field network |
-
2022
- 2022-05-13 CN CN202210523305.5A patent/CN115439706A/en active Pending
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116402747A (en) * | 2023-02-24 | 2023-07-07 | 上海白春学人工智能科技工作室 | Multi-receptive-field attention lung nodule benign and malignant classification and identification system and method |
CN117764988A (en) * | 2024-02-22 | 2024-03-26 | 山东省计算中心(国家超级计算济南中心) | Road crack detection method and system based on heteronuclear convolution multi-receptive field network |
CN117764988B (en) * | 2024-02-22 | 2024-04-30 | 山东省计算中心(国家超级计算济南中心) | Road crack detection method and system based on heteronuclear convolution multi-receptive field network |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110188239B (en) | A dual-stream video classification method and device based on cross-modal attention mechanism | |
Yu et al. | Bisenet: Bilateral segmentation network for real-time semantic segmentation | |
CN110032926B (en) | A deep learning-based video classification method and device | |
Wu et al. | Shift: A zero flop, zero parameter alternative to spatial convolutions | |
WO2023185243A1 (en) | Expression recognition method based on attention-modulated contextual spatial information | |
CN112801040B (en) | Lightweight unconstrained facial expression recognition method and system embedded with high-order information | |
CN115082698B (en) | A distracted driving behavior detection method based on multi-scale attention module | |
CN110020639B (en) | Video feature extraction method and related equipment | |
CN115439706A (en) | A multi-receptive field attention mechanism and system based on target detection | |
CN114333074B (en) | Human body posture estimation method based on dynamic lightweight high-resolution network | |
CN110032925A (en) | A kind of images of gestures segmentation and recognition methods based on improvement capsule network and algorithm | |
CN108446589A (en) | Face identification method based on low-rank decomposition and auxiliary dictionary under complex environment | |
CN115116054B (en) | A method for identifying pests and diseases based on multi-scale lightweight networks | |
CN111582095A (en) | A lightweight method for fast detection of abnormal pedestrian behavior | |
CN116189281B (en) | End-to-end human behavior classification method and system based on spatiotemporal adaptive fusion | |
CN116168197A (en) | An Image Segmentation Method Based on Transformer Segmentation Network and Regularization Training | |
Harkat et al. | Fire detection using residual deeplabv3+ model | |
CN114882267A (en) | Small sample image classification method and system based on relevant region | |
CN115171052B (en) | Crowded crowd attitude estimation method based on high-resolution context network | |
CN116543433A (en) | A mask wearing detection method and device based on the improved YOLOv7 model | |
Wang et al. | A Convolutional Neural Network Pruning Method Based On Attention Mechanism. | |
CN113361336A (en) | Method for positioning and identifying pedestrian view attribute in video monitoring scene based on attention mechanism | |
CN117542118A (en) | Unmanned aerial vehicle aerial video action recognition method based on space-time information dynamic modeling | |
Khan et al. | Binarized convolutional neural networks for efficient inference on GPUs | |
Rao et al. | Non-local attentive temporal network for video-based person re-identification |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |