CN116452937A

CN116452937A - Multimodal Feature Object Detection Method Based on Dynamic Convolution and Attention Mechanism

Info

Publication number: CN116452937A
Application number: CN202310454888.5A
Authority: CN
Inventors: 许国良; 王钰
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2023-04-25
Filing date: 2023-04-25
Publication date: 2023-07-18

Abstract

The invention relates to a multi-mode characteristic target detection method based on a dynamic convolution and attention mechanism, and belongs to the field of image recognition. In the method, two data streams are provided at the beginning stage of a backlight of YOLOv5, respectively input visible light images and infrared light images, and a dynamic convolution module ODConv, a multispectral convolution attention feature fusion module MS-CBAM and a residual error network are used for carrying out feature extraction operation. The invention has the advantages that the characteristics of the visible light image and the infrared image are combined, and the target detection precision of multi-mode and small targets is greatly improved by combining various attention mechanisms and architectures, so that the problem of weak target detection performance in a dim environment is solved. Compared with other multi-mode fusion target detection, the method has the advantages of high training speed and low hardware resource consumption.

Description

Multimodal Feature Object Detection Method Based on Dynamic Convolution and Attention Mechanism

技术领域technical field

本发明属于图像识别领域，涉及基于动态卷积与注意力机制的多模态特征目标检测方法。The invention belongs to the field of image recognition, and relates to a multimodal feature target detection method based on a dynamic convolution and attention mechanism.

背景技术Background technique

目标检测是计算机视觉任务中非常重要的一项技术，其性能直接影响相关任务的检测精度和运算效率。因此，该领域一直受到学术界、工业界等各方面的关注。本发明讨论的目标检测旨在利用新的模态数据和新的模态融合方法提升整体网络性能。例如在夜晚，交通系统很可能面临监控录像光源不足，想要从单一光谱数据源实现对绝大多数违章行为进行拍照、监测行人与车辆、车祸自动报警等功能有一定的困难。由红外光摄像机所拍摄的红外图像增强夜晚时车辆和行人等物体的可见光图像特征，可以极大提高夜晚目标检测精度。因此如何利用大量的多光谱图像数据实现对目标识别与检测模型性能的提升，是一项极具研究价值和挑战的任务。多模态特征融合双流神经网络将这两种不同模态的信息整合进深度学习神经网络，大大改善目标检测领域对于上述问题的训练精度和准确度。然而现有的CNN的卷积感受野只能在局部区域进行信息融合，双流卷积神经网络不能很好地利用不同模态之间的互补性，简单地将特征图叠加会增加神经网络的学习难度，加剧模态不平衡，从而导致性能下降。本发明对于现有的YOLOv5神经网络模型进行改造，引入改进的通道注意力、空间注意力和动态卷积组成模态融合模块，使其在多种注意力下更加充分地对上述两种模态进行跨模态融合、学习和预测。同时，使用NWD定位损失函数增强小目标检测精度。Object detection is a very important technology in computer vision tasks, and its performance directly affects the detection accuracy and computing efficiency of related tasks. Therefore, this field has been concerned by academia, industry and other aspects. The object detection discussed in this invention aims to improve the overall network performance by utilizing new modality data and new modality fusion methods. For example, at night, the traffic system is likely to face insufficient light sources for surveillance video, and it is difficult to realize functions such as taking pictures of most violations, monitoring pedestrians and vehicles, and automatically alerting traffic accidents from a single spectral data source. The infrared image captured by the infrared camera enhances the visible light image features of objects such as vehicles and pedestrians at night, which can greatly improve the detection accuracy of night targets. Therefore, how to use a large amount of multispectral image data to improve the performance of target recognition and detection models is a task of great research value and challenge. The multimodal feature fusion dual-stream neural network integrates the information of these two different modalities into the deep learning neural network, which greatly improves the training accuracy and accuracy of the above-mentioned problems in the field of target detection. However, the convolutional receptive field of the existing CNN can only perform information fusion in a local area, and the two-stream convolutional neural network cannot make good use of the complementarity between different modalities. Simply superimposing feature maps will increase the learning of the neural network. Difficulty, exacerbating the modal imbalance, leading to performance degradation. The present invention transforms the existing YOLOv5 neural network model, and introduces improved channel attention, spatial attention and dynamic convolution to form a mode fusion module, so that it can more fully understand the above two modes under multiple attention Perform cross-modal fusion, learning and prediction. Meanwhile, the NWD localization loss function is used to enhance the small object detection accuracy.

发明内容Contents of the invention

有鉴于此，本发明的目的在于提供基于动态卷积与注意力机制的多模态特征目标检测方法。In view of this, the object of the present invention is to provide a multimodal feature target detection method based on a dynamic convolution and attention mechanism.

为达到上述目的，本发明提供如下技术方案：To achieve the above object, the present invention provides the following technical solutions:

基于动态卷积与注意力机制的多模态特征目标检测方法，该方法包括以下步骤：A multimodal feature target detection method based on dynamic convolution and attention mechanism, the method includes the following steps:

S1：建立基于YOLOv5的双流卷积检测网络的神经网络模型，其中Backbone采用卷积操作和特征融合模块进行模态融合与特征学习；S1: Establish a neural network model based on YOLOv5's dual-stream convolutional detection network, in which Backbone uses convolution operations and feature fusion modules for modality fusion and feature learning;

S2：采用通道注意力和空间注意力组成多光谱模块MS-CBAM，使用通道注意力分别对可见光与红外光图像特征图进行特征加权，之后将红外光与可见光图像堆叠至一起使用空间注意力对特征图进行特征加权，之后使用残差网络细化特征；S2: Use channel attention and spatial attention to form a multispectral module MS-CBAM, use channel attention to weight the feature maps of visible light and infrared light images, and then stack infrared and visible light images together and use spatial attention to pair The feature map is used for feature weighting, and then the residual network is used to refine the features;

S3：对卷积结构引入多头注意力机制，通过对输入通道维度、输出通道维度、空间维度与卷积核四个维度赋予卷积不同的注意力系数矩阵，建立动态卷积ODConv模块；S3: Introduce a multi-head attention mechanism to the convolution structure, and establish a dynamic convolution ODConv module by assigning different attention coefficient matrices to the four dimensions of input channel dimension, output channel dimension, spatial dimension and convolution kernel;

S4：设置MS-CBAM模块作为80×80×256的特征图较大的位置进行输出，ODConv模块作为40×40×512和20×20×1024的特征图为中、小的位置进行输出；输出三个不等大小的特征图进入Neck层即特征金字塔，进行特征提取，对输出的特征进行预测，并输出预测结果；S4: Set the MS-CBAM module to output as the larger position of the 80×80×256 feature map, and the ODConv module to output as the medium and small position of the feature map of 40×40×512 and 20×20×1024; output Three feature maps of different sizes enter the Neck layer, which is the feature pyramid, for feature extraction, predict the output features, and output the prediction results;

S5：在训练阶段，可见光和红外光数据经过特定的Mosaic数据增强、自适应锚框计算、自适应图片缩放过程后进入双流神经网络训练；采用YOLO v5l预训练权重来进行初始化，并使用随机梯度下降算法来学习网络的参数；S5: In the training phase, the visible light and infrared light data enter the dual-stream neural network training after specific Mosaic data enhancement, adaptive anchor frame calculation, and adaptive image scaling process; YOLO v5l pre-trained weights are used for initialization, and random gradients are used Descent algorithm to learn the parameters of the network;

在预测阶段，使用softmax分类器获得所属类别的最终分类概率；In the prediction stage, use the softmax classifier to obtain the final classification probability of the category;

在优化阶段，采用定位损失、分类损失、置信度损失联合优化的方式减少真实值与预测值之间的误差，并在定位损失中引入NWD，提升小目标检测的精度；不断重复S5，直到迭代次数达到设定的迭代次数时，模型训练完成，进行目标检测任务。In the optimization stage, the joint optimization of positioning loss, classification loss, and confidence loss is used to reduce the error between the real value and the predicted value, and NWD is introduced into the positioning loss to improve the accuracy of small target detection; repeat S5 until iteration When the number of iterations reaches the set number of iterations, the model training is completed and the target detection task is performed.

可选的，在所述S1中，基于YOLO v5的双流卷积目标检测网络框架的输入为不同模态的图像对，Backbone为双流卷积网络，双流神经网络模型包括Backbone、Neck、预测层；Optionally, in said S1, the input of the YOLO v5-based dual-stream convolutional target detection network framework is image pairs of different modalities, Backbone is a dual-stream convolutional network, and the dual-stream neural network model includes Backbone, Neck, and a prediction layer;

设输入的可见光特征图为X_V，输入的红外光特征图为X_T，特征图的长、宽、通道数分别为H、W、C；Let the input visible light feature map be X _V , the input infrared light feature map be X _T , and the length, width, and number of channels of the feature map be H, W, and C respectively;

特征提取网络结构使用三个特征融合模块与残差网络组成三次特征提取循环与细化结构，第i次特征融合计算过程表示为：The feature extraction network structure uses three feature fusion modules and the residual network to form three feature extraction loops and a thinning structure. The i-th feature fusion calculation process is expressed as:

其中σ为特征融合函数，可见光图像输入特征图为X_V，红外光图像输入特征图为X_T，F为特征融合模块，进行批量归一化运算；融合特征图的长、宽、通道数分别为H、W、2C；之后将融合特征与原始特征构建残差网络：Among them, σ is the feature fusion function, the visible light image input feature map is X _V , the infrared light image input feature map is X _T , and F is the feature fusion module, which performs batch normalization operation; fusion feature map The length, width, and number of channels are H, W, and 2C respectively; after that, the residual network will be constructed by fusing features and original features:

为可见光与红外光获取新的特征图f_t ⁱ和 Obtain new feature maps f _t ⁱ and

可选的，在所述S2中，对可见光和红外光输入图像，分别对二者进行通道注意力机制计算，之后进行特征图按照通道维度叠加的方式对特征图进行叠加，之后输入至空间注意力进行运算；Optionally, in said S2, for visible light and infrared light input images, the channel attention mechanism calculation is performed on the two, and then the feature map is superimposed according to the channel dimension superimposition method, and then input to the spatial attention power to operate;

MS-CBAM模块的计算表示为：The calculation of the MS-CBAM module is expressed as:

X＝M_S[concat[M_C(X_V),M_C(X_T)]]X＝M _S [concat[M _C (X _V ),M _C (X _T )]]

其中，M_C代表通道注意力机制，M_S代表空间注意力机制；Concat表示对特征图在通道维度进行堆叠；Among them, M _C represents the channel attention mechanism, M _S represents the spatial attention mechanism; Concat represents the stacking of feature maps in the channel dimension;

之后对X构建残差网络进行细化，过程表示为：Afterwards, the residual network of X is refined, and the process is expressed as:

X'_V＝X_V+X _X'V ＝ _XV +X

X'_T＝X_T+XX' _T = X _T + X

最终获得的特征图为X'_V∈V^B×C×H×W、X'_T∈T^B×C×H×W，表示MS-CBAM模块的最终输出。The finally obtained feature maps are X' _V ∈ ^{V B×C×H×W} , X' _T ∈ T ^B×C×H×W , representing the final output of the MS-CBAM module.

可选的，在所述S3中，在卷积过程中引入多头自注意力机制，在输入通道维度、输出通道维度、空间维度与卷积核四个维度赋予卷积不同的注意力系数矩阵ODConv，提升特征提取的能力；ODConv模块整体的运算表示为：Optionally, in the S3, a multi-head self-attention mechanism is introduced in the convolution process, and different attention coefficient matrices ODConv are given to convolution in the four dimensions of input channel dimension, output channel dimension, spatial dimension and convolution kernel. , to improve the ability of feature extraction; the overall operation of the ODConv module is expressed as:

X'＝ODConv(concat(X_V,X_T))X'＝ODConv(concat(X _V ,X _T ))

其中，X_V和X_T分别为可见光与红外光模态的特征图输入，concat代表两个输入沿着通道数维度进行叠加，ODConv代表动态卷积操作；Among them, X _V and X _T are the feature map inputs of visible light and infrared light modes respectively, concat means that the two inputs are superimposed along the channel number dimension, and ODConv means dynamic convolution operation;

其综合了四个维度的动态卷积公式表示为：Its dynamic convolution formula that combines four dimensions is expressed as:

y＝(α_w1⊙α_f1⊙α_c1⊙α_s1⊙W₁+...+α_wn⊙α_fn⊙α_cn⊙α_sn⊙W_n)*xy＝(α _w1 ⊙α _f1 ⊙α _c1 ⊙α _s1 ⊙W ₁ +...+α _wn ⊙α _fn ⊙α _cn ⊙α _sn ⊙W _n )*x

为卷积核维度W_i的注意力系数矩阵，/>和/>分别表示沿着卷积核W_i中的空间维度、输入通道维度、输出通道维度上的动态卷积注意力系数矩阵，⊙表示沿着核空间的不同维度的乘法运算。 is the attention coefficient matrix of the convolution kernel dimension W _i , /> and /> Respectively represent the dynamic convolution attention coefficient matrix along the spatial dimension, input channel dimension, and output channel dimension of the convolution kernel W _i , and ⊙ represents the multiplication along different dimensions of the kernel space.

可选的，所述MS-CBAM模块与ODConv模块输入与输出均为可见光与红外光特征图，输出将与输入组成残差网络；Optionally, the input and output of the MS-CBAM module and the ODConv module are both visible light and infrared light feature maps, and the output will form a residual network with the input;

所述定位损失、分类损失、置信度损失的损失函数表达为：The loss functions of the positioning loss, classification loss, and confidence loss are expressed as:

L_total＝L_box+L_cls+L_conf L _total ＝L _box +L _cls +L _conf

其中，定位损失采用的是NWD损失函数；NWD损失函数通过引入NormalizedWasserstein Distance计算方法，通过对应的高斯分布来计算相似性。Among them, the positioning loss uses the NWD loss function; the NWD loss function introduces the NormalizedWasserstein Distance calculation method, and calculates the similarity through the corresponding Gaussian distribution.

本发明的有益效果在于：本发明可以很好地优化对于图像整体或部分亮度不足的条件下目标检测，并且在预测精度和可靠性方面，应用于目标检测系统时更加具有优势。The beneficial effect of the present invention is that: the present invention can well optimize target detection under the condition of insufficient brightness of the whole or part of the image, and has more advantages in terms of prediction accuracy and reliability when applied to a target detection system.

本发明的其他优点、目标和特征在某种程度上将在随后的说明书中进行阐述，并且在某种程度上，基于对下文的考察研究对本领域技术人员而言将是显而易见的，或者可以从本发明的实践中得到教导。本发明的目标和其他优点可以通过下面的说明书来实现和获得。Other advantages, objects and features of the present invention will be set forth in the following description to some extent, and to some extent, will be obvious to those skilled in the art based on the investigation and research below, or can be obtained from It is taught in the practice of the present invention. The objects and other advantages of the invention may be realized and attained by the following specification.

附图说明Description of drawings

为了使本发明的目的、技术方案和优点更加清楚，下面将结合附图对本发明作优选的详细描述，其中：In order to make the purpose of the present invention, technical solutions and advantages clearer, the present invention will be described in detail below in conjunction with the accompanying drawings, wherein:

图1为本发明整体架构流程图；Fig. 1 is a flow chart of the overall architecture of the present invention;

图2为本发明Backbone流程图；Fig. 2 is Backbone flowchart of the present invention;

图3为动态卷积特征融合模块结构图；Fig. 3 is a structural diagram of a dynamic convolution feature fusion module;

图4为MS-CBAM特征融合模块结构图。Figure 4 is a structural diagram of the MS-CBAM feature fusion module.

具体实施方式Detailed ways

以下通过特定的具体实例说明本发明的实施方式，本领域技术人员可由本说明书所揭露的内容轻易地了解本发明的其他优点与功效。本发明还可以通过另外不同的具体实施方式加以实施或应用，本说明书中的各项细节也可以基于不同观点与应用，在没有背离本发明的精神下进行各种修饰或改变。需要说明的是，以下实施例中所提供的图示仅以示意方式说明本发明的基本构想，在不冲突的情况下，以下实施例及实施例中的特征可以相互组合。Embodiments of the present invention are described below through specific examples, and those skilled in the art can easily understand other advantages and effects of the present invention from the content disclosed in this specification. The present invention can also be implemented or applied through other different specific implementation modes, and various modifications or changes can be made to the details in this specification based on different viewpoints and applications without departing from the spirit of the present invention. It should be noted that the diagrams provided in the following embodiments are only schematically illustrating the basic concept of the present invention, and the following embodiments and the features in the embodiments can be combined with each other in the case of no conflict.

其中，附图仅用于示例性说明，表示的仅是示意图，而非实物图，不能理解为对本发明的限制；为了更好地说明本发明的实施例，附图某些部件会有省略、放大或缩小，并不代表实际产品的尺寸；对本领域技术人员来说，附图中某些公知结构及其说明可能省略是可以理解的。Wherein, the accompanying drawings are for illustrative purposes only, and represent only schematic diagrams, rather than physical drawings, and should not be construed as limiting the present invention; in order to better illustrate the embodiments of the present invention, some parts of the accompanying drawings may be omitted, Enlargement or reduction does not represent the size of the actual product; for those skilled in the art, it is understandable that certain known structures and their descriptions in the drawings may be omitted.

本发明实施例的附图中相同或相似的标号对应相同或相似的部件；在本发明的描述中，需要理解的是，若有术语“上”、“下”、“左”、“右”、“前”、“后”等指示的方位或位置关系为基于附图所示的方位或位置关系，仅是为了便于描述本发明和简化描述，而不是指示或暗示所指的装置或元件必须具有特定的方位、以特定的方位构造和操作，因此附图中描述位置关系的用语仅用于示例性说明，不能理解为对本发明的限制，对于本领域的普通技术人员而言，可以根据具体情况理解上述术语的具体含义。In the drawings of the embodiments of the present invention, the same or similar symbols correspond to the same or similar components; , "front", "rear" and other indicated orientations or positional relationships are based on the orientations or positional relationships shown in the drawings, which are only for the convenience of describing the present invention and simplifying the description, rather than indicating or implying that the referred devices or elements must It has a specific orientation, is constructed and operated in a specific orientation, so the terms describing the positional relationship in the drawings are for illustrative purposes only, and should not be construed as limiting the present invention. For those of ordinary skill in the art, the understanding of the specific meaning of the above terms.

本发明提供的一种基于动态卷积与注意力机制和YOLO v5双流网络的目标检测方法，如图1所示，该方法包括以下步骤：A kind of target detection method based on dynamic convolution and attention mechanism and YOLO v5 dual-stream network provided by the present invention, as shown in Figure 1, the method comprises the following steps:

步骤1：在步骤1中构建基本双流神经网络模型，如图2所示，YOLO v5包括数据处理、Backbone、Neck、预测层，以及本发明设计了一种基于动态卷积特征融合模块、MS-CBAM多模态循环融合与细化的特征提取与融合思路，将融合操作重复多次并加以残差处理，以增加多光谱特征的一致性。Step 1: Build the basic two-stream neural network model in step 1, as shown in Figure 2, YOLO v5 includes data processing, Backbone, Neck, prediction layer, and the present invention designs a kind of fusion module based on dynamic convolution feature, MS- The idea of feature extraction and fusion of CBAM multimodal cyclic fusion and refinement repeats the fusion operation multiple times and performs residual processing to increase the consistency of multispectral features.

步骤2：在步骤2中构建基于输入通道、输出通道、空间与卷积核四个维度赋予卷积不同的注意力系数矩阵的动态卷积ODConv，如图3所示，利用一种多头注意力机制和并行策略来学习卷积核中四个维度上的模态互补注意力。Step 2: In step 2, construct a dynamic convolution ODConv based on the four dimensions of input channel, output channel, space and convolution kernel to give convolution different attention coefficient matrices, as shown in Figure 3, using a multi-head attention mechanism and parallel strategies to learn modality-complementary attention in four dimensions in convolutional kernels.

步骤3：构建基于通道注意力和空间注意力的MS-CBAM模块，如图4所示，分别在通道维度和空间维度对特征图进行加权，并利用残差网络进行特征细化。Step 3: Construct the MS-CBAM module based on channel attention and spatial attention, as shown in Figure 4, weight the feature map in the channel dimension and spatial dimension respectively, and use the residual network for feature refinement.

步骤4：设置MS-CBAM模块作为80×80×256的特征图较大的位置进行输出，ODConv模块作为40×40×512和20×20×1024的特征图为中、小大小的位置进行输出。将特征图输入至YOLO v5的特征金字塔，继续进行YOLO v5的特征融合与预测；Step 4: Set the MS-CBAM module to output as the larger position of the feature map of 80×80×256, and the ODConv module to output as the position of the medium and small size of the feature map of 40×40×512 and 20×20×1024 . Input the feature map to the feature pyramid of YOLO v5, and continue the feature fusion and prediction of YOLO v5;

步骤5：采用训练样本对确定参数后的神经网络进行训练直至满足训练条件，采用测试集对训练后的神经网络进行测试；Step 5: Use the training samples to train the neural network after the parameters are determined until the training conditions are met, and use the test set to test the trained neural network;

在步骤1中，本发明基于YOLO v5的Neck与Head层进行预测，并建立基于双流卷积网络的基线网络，其特征在于首先使用卷积网络提取可见光与红外光双模态各自的局部特征，之后使用特征融合模块进行特征加权融合操作。In step 1, the present invention predicts based on the Neck and Head layers of YOLO v5, and establishes a baseline network based on a two-stream convolutional network, which is characterized in that the convolutional network is first used to extract the respective local features of the dual modes of visible light and infrared light, Then use the feature fusion module to perform the feature weighted fusion operation.

首先，经过步骤1处理过后的可见光与红外光图像分别进行三次卷积操作，卷积后的可见光、红外光特征图表示为X_V、X_T。First, the visible light and infrared light images processed in step 1 are respectively subjected to three convolution operations, and the convolved visible light and infrared light feature maps are denoted as X _V , X _T .

本发明设计了一种使用MS-CBAM模块与ODConv模块组成残差网络的特征融合方式进行特征融合的操作。如图2所示，模块与残差网络共同构建出特征循环融合与细化的特征提取与融合思路。本发明将特征融合操作分别在YOLO v5网络中80×80×256、40×40×512、20×20×1024三个地方，即图2中的P3、P4、P5代表的大、中、小三个特征图进入特征金字塔。本发明的特征循环融合与细化结构可以增加多光谱特征的一致性。设在第i个融合模块中，为了获取新的融合特征f，可见光图像特征X_V与红外光图像特征X_T的融合过程可以被表示为：The present invention designs a feature fusion operation using the MS-CBAM module and the ODConv module to form a residual network to perform feature fusion. As shown in Figure 2, the module and the residual network jointly construct a feature extraction and fusion idea of feature loop fusion and refinement. In the present invention, the feature fusion operation is respectively performed in three places of 80×80×256, 40×40×512, and 20×20×1024 in the YOLO v5 network, that is, the large, medium and small three represented by P3, P4 and P5 in Fig. 2 feature maps into the feature pyramid. The feature cycle fusion and refinement structure of the present invention can increase the consistency of multispectral features. Assuming that in the i-th fusion module, in order to obtain a new fusion feature f, the fusion process of the visible light image feature X _V and the infrared light image feature X _T can be expressed as:

其中σ为特征融合函数，F为特征融合模块。where σ is the feature fusion function, and F is the feature fusion module.

为了避免过度拟合，所有循环中的操作F共享权重，然后将融合特征与原始特征构建残差网络：To avoid overfitting, the operation F in all cycles shares weights, and then combines the fusion features with the original features to build a residual network:

为了防止学习网络参数时的消失梯度问题，并更好地进行多光谱特征融合，使用辅助语义分割任务为每个细化的光谱特征带来单独的信息。To prevent the vanishing gradient problem when learning network parameters and perform better multispectral feature fusion, an auxiliary semantic segmentation task is used to bring individual information to each refined spectral feature.

模态之间的相似性随着循环数量的增加而增加，而随着光谱特征之间相似性的增加，它们的一致性增加，互补性降低。多光谱特征之间的一致性非常重要，但是相反，一致性过多则会导致特征值的急剧上升或下降，多余的循环融合毫无意义。经实验，第四次循环及之后，特征融合性能开始下降，所以在实践中，我们选用三次循环来平衡一致性与互补性。The similarity between the modalities increases with the number of cycles, while their coherence increases and complementarity decreases as the similarity between spectral features increases. The consistency between multispectral features is very important, but on the contrary, too much consistency will lead to a sharp rise or fall of feature values, and redundant cyclic fusion is meaningless. According to experiments, the performance of feature fusion begins to decline after the fourth cycle, so in practice, we choose three cycles to balance consistency and complementarity.

同时，三个特征融合模块将分三次向特征金字塔输入大、中、小三种经过处理的特征图。At the same time, the three feature fusion modules will input the processed feature maps of large, medium and small to the feature pyramid three times.

进一步，在步骤2中，本发明以使用对卷积核维度进行多头自注意力机制运算的动态卷积进行举例说明。对于动态卷积层，它使用n个卷积核的线性组合，通过注意力机制动态加权，使卷积运算依赖于输入的特征图。ODConv整体的运算可以表示为：Further, in step 2, the present invention is illustrated by using a dynamic convolution that uses a multi-head self-attention mechanism operation on the dimension of the convolution kernel. For the dynamic convolution layer, it uses a linear combination of n convolution kernels, dynamically weighted by the attention mechanism, so that the convolution operation depends on the input feature map. The overall operation of ODConv can be expressed as:

X'＝ODConv(concat(X_v,X_T))X'＝ODConv(concat(X _v ,X _T ))

其中，X_V和X_T分别为可见光与红外光模态的特征图输入，concat代表两个输入沿着通道数维度进行叠加，ODConv代表动态卷积操作。Among them, X _V and X _T are the feature map inputs of visible light and infrared light modes respectively, concat represents the superposition of two inputs along the channel number dimension, and ODConv represents the dynamic convolution operation.

具体地，在数学上，可以定义单维度上的动态卷积运算为：Specifically, mathematically, the dynamic convolution operation on a single dimension can be defined as:

y＝(α_w1W₁+...+α_wnW_n)*xy＝(α _w1 W ₁ +...+α _wn W _n )*x

其中，和/>分别代表高为h，宽为w，通道数为c的特征图矩阵的输入和输出。W_i表示由输出卷积滤波器/>组成的第i个卷积核，m＝1,…,c_out。为卷积核维度的注意力系数矩阵，其由以输入特征为条件的注意力函数π_wi(x)计算；*表示卷积运算，这里省略了偏置项。in, and /> Represent the input and output of the feature map matrix whose height is h, width is w, and the number of channels is c. W _i denote by the output convolution filter /> The i-th convolution kernel composed of m=1,...,c _out . is the attention coefficient matrix of the convolution kernel dimension, which is calculated by the attention function π _wi (x) conditioned on the input features; * indicates the convolution operation, and the bias term is omitted here.

根据动态卷积计算等式，动态卷积有两个基本组成部分：给定n个卷积核，卷积核W_i和用于计算其注意力标量的注意力函数对应的核空间中具有关于空间核大小为k×k的四个维度，每个卷积核具有输入通道数c_in和输出通道数c_out。According to the dynamic convolution calculation equation, dynamic convolution has two basic components: Given n convolution kernels, the convolution kernel W _i and the attention function used to calculate its attention scalar The corresponding kernel space has four dimensions with respect to the spatial kernel size k×k, and each convolution kernel has the number of input channels c _in and the number of output channels c _out .

本发明中的ODConv模块同时兼顾卷积核维度、空间维度、输入通道维度和输出通道维度，这使卷积运算中的多模态特征融合更加全面，其每个维度的公式与卷积核维度的动态卷积相似。如图3所示，其综合了四个维度的动态卷积公式可以表示为：The ODConv module in the present invention takes into account the convolution kernel dimension, the spatial dimension, the input channel dimension and the output channel dimension at the same time, which makes the multi-modal feature fusion in the convolution operation more comprehensive, and the formula of each dimension is related to the convolution kernel dimension The dynamic convolution of is similar. As shown in Figure 3, its dynamic convolution formula that combines four dimensions can be expressed as:

为卷积核W_i的注意力系数矩阵，/>和/>分别表示沿着卷积核W_i中的空间维度、输入通道维度、输出通道维度上的动态卷积注意力系数矩阵，⊙表示沿着核空间的不同维度的乘法运算。 is the attention coefficient matrix of the convolution kernel W _i , /> and /> Respectively represent the dynamic convolution attention coefficient matrix along the spatial dimension, input channel dimension, and output channel dimension of the convolution kernel W _i , and ⊙ represents the multiplication along different dimensions of the kernel space.

其中，α_si在k×k个空间位置为每个卷积滤波器分配不同的注意力标量；α_ci为每个卷积滤波器W_i ^m的c_in通道分配不同的注意力标量；α_fi为每个卷积滤波器W_i ^m的c_out通道分配不同的注意力标量；α_wi将注意力标量分配给整个卷积核。其将这四种维度的注意力系数矩阵与给n个卷积核的对应维度相乘，得出模块的输出。Among them, α _si assigns different attention scalars to each convolution filter at k×k spatial positions; α _ci assigns different attention scalars to the c _in channel of each convolution filter W _i ^m ; α _fi A different attention scalar is assigned to the c _out channel of each convolution filter W _i ^m ; α _wi assigns the attention scalar to the entire convolution kernel. It multiplies the attention coefficient matrix of these four dimensions with the corresponding dimensions for n convolution kernels to obtain the output of the module.

具体地，首先通过全局平均池化操作将输入X压缩为具有c_in长度的特征向量，经过全连接层与ReLU单元，全连接层将压缩的特征向量映射到具有缩减率r的低维空间。之后经过四个分支，每个分支对应上述四种维度，其均有一个输出大小为k×k、c_in×1、c_out×1和n×1的FC层，以及一个Softmax或Sigmoid函数，分别生成归一化注意力系数矩阵α_si、α_ci、α_fi和α_wi。Specifically, firstly, the input X is compressed into a feature vector with a length of c _in through the global average pooling operation, and the fully connected layer maps the compressed feature vector to a low-dimensional space with a reduction rate r after a fully connected layer and a ReLU unit. After four branches, each branch corresponds to the above four dimensions, each of which has an FC layer with an output size of k×k, c _in ×1, c _out ×1 and n×1, and a Softmax or Sigmoid function, Generate normalized attention coefficient matrices α _si , α _ci , α _fi and α _wi respectively.

由于这四种维度是互补的，并且能够捕获丰富的上下文线索。因此，ODConv可以显著增强CNN基本卷积运算的特征提取能力。Since these four dimensions are complementary and able to capture rich contextual cues. Therefore, ODConv can significantly enhance the feature extraction capability of basic convolution operations of CNNs.

进一步，在步骤3中建立基于通道注意力和空间注意力的MS-CBAM模块，分别在通道维度和空间维度对特征图进行加权，并利用残差网络进行特征细化。Further, in step 3, the MS-CBAM module based on channel attention and spatial attention is established, the feature maps are weighted in the channel dimension and spatial dimension respectively, and the residual network is used for feature refinement.

对于输入的特征图X_V∈V^B×C×H×W，X_T∈T^B×C×H×W，其中V代表可见光图像，T代表红外光图像、B代表Batch Size，C代表通道数，H、W分别代表特征图的长和宽，单位是像素。MS-CBAM模块的计算可以表示为：For the input feature map X _V ∈ ^{V B×C×H×W} , X _T ∈ T ^B×C×H×W , where V represents the visible light image, T represents the infrared light image, B represents the Batch Size, and C represents the number of channels , H and W respectively represent the length and width of the feature map, and the unit is pixel. The calculation of the MS-CBAM module can be expressed as:

X＝M_s[concat[M_c(X_V),M_c(X_T)]]X＝M _s [concat[M _c (X _V ),M _c (X _T )]]

其中，M_c代表通道注意力机制，M_s代表空间注意力机制。Concat表示对特征图在通道维度进行堆叠，X代表模块输出。通过通道注意力和空间注意力可以在通道维度与空间维度进行特征加权，可以减少单独使用某一种类型的池化操作而带来的不良影响，并增加神经网络的准确度性能。Among them, M _c represents the channel attention mechanism, and M _s represents the spatial attention mechanism. Concat means to stack the feature map in the channel dimension, and X represents the module output. Through channel attention and spatial attention, feature weighting can be carried out in the channel dimension and spatial dimension, which can reduce the adverse effects caused by using a certain type of pooling operation alone, and increase the accuracy performance of the neural network.

通道注意力模块(Channel Attention Module，CAM)通过学习每个通道之间的相互作用来提高特征图的表示能力。具体地，通道注意力模块首先对输入特征图中的每个通道依次进行最大池化与平均池化操作，得到最大池化和平均池化的特征图。然后将这两个特征图作为输入，通过两个全连接层和Sigmoid函数得到每个通道的权重，将通道权重与原始特征图相乘得到加权特征图。通道注意力机制可以表达为：The Channel Attention Module (CAM) improves the representation ability of feature maps by learning the interactions between each channel. Specifically, the channel attention module first performs the maximum pooling and average pooling operations on each channel in the input feature map in turn to obtain the maximum pooling and average pooling feature maps. Then these two feature maps are used as input, the weight of each channel is obtained through two fully connected layers and the Sigmoid function, and the channel weight is multiplied by the original feature map to obtain a weighted feature map. The channel attention mechanism can be expressed as:

式中，和/>分别表示平均池化和最大池化。In the formula, and /> represent average pooling and maximum pooling, respectively.

空间注意力模块(SpatialAttention Module，SAM)通过学习特征图中每个像素之间的相互作用来提高特征图的表示能力。该模块的输入特征图是通道注意力模块输出的特征图。首先对于一个输入特征图，空间注意力模块首先对其进行最大池化和平均池化操作，得到最大池化和平均池化特征图。然后将这两个特征图拼接起来，通过一个卷积层和Sigmoid函数得到每个像素的权重，将像素权重与原始特征图相乘得到加权特征图。然后，对可见光和红外光特征图的通道维度分别进行了平均值池化和最大值池化，得到两个大小为的特征图。接着，这两个特征图在通道维度上进行拼接操作，得到一个大小为的特征图。最后，该特征图经过一个7×7的卷积操作降维为1个通道，然后通过Sigmoid激活函数生成空间注意力特征。The Spatial Attention Module (SAM) improves the representation ability of feature maps by learning the interaction between each pixel in the feature map. The input feature map of this module is the feature map output by the channel attention module. First, for an input feature map, the spatial attention module first performs maximum pooling and average pooling operations on it to obtain maximum pooling and average pooling feature maps. Then the two feature maps are stitched together, the weight of each pixel is obtained through a convolutional layer and the Sigmoid function, and the weighted feature map is obtained by multiplying the pixel weight with the original feature map. Then, average pooling and maximum pooling are performed on the channel dimensions of the visible and infrared feature maps, respectively, to obtain two feature maps of size . Then, the two feature maps are concatenated in the channel dimension to obtain a feature map of size . Finally, the feature map is reduced to 1 channel through a 7×7 convolution operation, and then the spatial attention feature is generated through the Sigmoid activation function.

最后，将空间注意力的输出特征与输入特征进行逐元素相乘，得到最终生成的特征。空间注意力机制可以表示为：Finally, the output features of spatial attention are multiplied element-wise by the input features to obtain the final generated features. The spatial attention mechanism can be expressed as:

本发明使用通道注意力和空间注意力，之后对X构建残差网络进行细化，过程可以表示为：The present invention uses channel attention and spatial attention, and then refines the residual network built by X. The process can be expressed as:

X'_V＝X_V+X _X'V ＝ _XV +X

X'_T＝X_T+XX' _T = X _T + X

进一步，在步骤4中，对特征图大小H、W、C分别为80×80×256、40×40×512、20×20×1024即图2中P3、P4、P5的三个位置的特征图分别使用MS-CBAM、ODConv、ODConv进行多模态特征融合，之后对这三个大、中、小三个特征图输入进YOLO v5 Neck特征金字塔中进行进一步的特征融合与提取。Further, in step 4, the feature map sizes H, W, and C are respectively 80×80×256, 40×40×512, 20×20×1024, that is, the features of the three positions of P3, P4, and P5 in Figure 2 Figures use MS-CBAM, ODConv, and ODConv for multimodal feature fusion, and then input the three large, medium, and small feature maps into the YOLO v5 Neck feature pyramid for further feature fusion and extraction.

在步骤5中，损失函数分为定位损失、分类损失、置信度损失，可以表示为：In step 5, the loss function is divided into localization loss, classification loss, and confidence loss, which can be expressed as:

其中定位损失采用的是与NWD，其他损失采用YOLO v5默认损失函数：The positioning loss uses NWD, and other losses use the default loss function of YOLO v5:

NWD使用基于Wasserstein距离的度量方式，使得小目标检测性能得到大幅度提高。NWD uses a metric based on Wasserstein distance, which greatly improves the performance of small target detection.

对于小目标来说，包围框里总是会有一些背景像素的，因为真实的物体不可能正好是个矩形。在包围框中，前景像素一般集中在中间，背景像素一般集中在边上。为了更好地对包围框中的每个像素进行加权，可以将包围框建模成一个2D的高斯分布。具体来说，对于水平的包围框R＝(cx,cy,w,h)，用内接椭圆可以表示为：For small objects, there will always be some background pixels in the bounding box, because the real object cannot be exactly a rectangle. In the bounding box, the foreground pixels are generally concentrated in the middle, and the background pixels are generally concentrated on the sides. To better weight each pixel in the bounding box, the bounding box can be modeled as a 2D Gaussian distribution. Specifically, for a horizontal bounding box R=(cx, cy, w, h), an inscribed ellipse can be expressed as:

其中，(μ_x,μ_y)是椭圆的中心点，(σ_x,σ_y)是x和y轴的半径。对应到包围框中：where (μ _x ,μ _y ) is the center point of the ellipse, and (σ _x ,σ _y ) are the radii of the x and y axes. Corresponding to the bounding box:

μ_x＝cx,μ_y＝cy, μ _x = cx, μ _y = cy,

2D高斯分布的概率密度函数为：The probability density function of a 2D Gaussian distribution is:

其中，X，μ，∑分别表示坐标(x,y)，均值和方差。当：Among them, X, μ, ∑ represent the coordinates (x, y), mean and variance respectively. when:

这个椭圆就是2D高斯分布的一个分布轮廓。因此，水平包围框R＝(cx,cy,w,h)可以建模为一个2D高斯分布：This ellipse is a distribution profile of the 2D Gaussian distribution. Therefore, the horizontal bounding box R=(cx,cy,w,h) can be modeled as a 2D Gaussian distribution:

这样一来，两个包围框之间的相似度可以用这两个高斯分布之间的距离来表示。In this way, the similarity between two bounding boxes can be represented by the distance between these two Gaussian distributions.

紧接着，本发明使用最优传输理论中的Wasserstein距离来计算两个分布的距离。对于两个2D高斯分布，其2阶Wasserstein距离可以定义为：Next, the present invention uses the Wasserstein distance in optimal transport theory to calculate the distance of the two distributions. For two 2D Gaussian distributions, the 2nd-order Wasserstein distance can be defined as:

即：Right now:

对于两个包围框来说：For two bounding boxes:

但是，这是个距离度量，不能直接用于相似度。我们用归一化后的指数来得到一个新的度量，叫做归一化的Wasserstein距离：However, this is a distance metric and cannot be used directly for similarity. We use the normalized exponent to get a new metric called the normalized Wasserstein distance:

这里C是一个常数，和数据集有关。Here C is a constant, related to the data set.

之后，对构建好的模型输入数据集进行训练，每迭代一个epoch则保存当前epoch的模型参数，并比较当前epoch的分类精度与之前的最优模型的分类精度。当达到设定的最大epoch时，输出识别准确度最优的行人目标识别模型。完成训练后的模型可以实现对光线不好的环境下目标的检测与识别，包括人、动物、汽车、其他交通工具以及障碍物等物体的检测识别。Afterwards, the constructed model input data set is trained, and the model parameters of the current epoch are saved for each iteration of an epoch, and the classification accuracy of the current epoch is compared with the classification accuracy of the previous optimal model. When the set maximum epoch is reached, the pedestrian target recognition model with the best recognition accuracy is output. After training, the model can realize the detection and recognition of targets in poor light environments, including the detection and recognition of objects such as people, animals, cars, other vehicles, and obstacles.

最后说明的是，以上实施例仅用以说明本发明的技术方案而非限制，尽管参照较佳实施例对本发明进行了详细说明，本领域的普通技术人员应当理解，可以对本发明的技术方案进行修改或者等同替换，而不脱离本技术方案的宗旨和范围，其均应涵盖在本发明的权利要求范围当中。Finally, it is noted that the above embodiments are only used to illustrate the technical solutions of the present invention without limitation. Although the present invention has been described in detail with reference to the preferred embodiments, those of ordinary skill in the art should understand that the technical solutions of the present invention can be carried out Modifications or equivalent replacements, without departing from the spirit and scope of the technical solution, should be included in the scope of the claims of the present invention.

Claims

1. The multi-mode characteristic target detection method based on the dynamic convolution and the attention mechanism is characterized by comprising the following steps of: the method comprises the following steps:

s1: a neural network model of a double-flow convolution detection network based on YOLOv5 is established, wherein a back bone adopts convolution operation and a feature fusion module to perform modal fusion and feature learning;

s2: the method comprises the steps that a multispectral module MS-CBAM is formed by adopting channel attention and spatial attention, the channel attention is used for respectively carrying out feature weighting on visible light and infrared light image feature graphs, infrared light and visible light images are stacked together, the spatial attention is used for carrying out feature weighting on the feature graphs, and then a residual network is used for refining features;

s3: introducing a multi-head attention mechanism to a convolution structure, and establishing a dynamic convolution ODConv module by endowing different attention coefficient matrixes of convolution to the four dimensions of an input channel dimension, an output channel dimension, a space dimension and a convolution kernel;

s4: setting an MS-CBAM module as a position with a larger characteristic diagram of 80 multiplied by 256 to output, and setting an ODConv module as a position with a medium and small characteristic diagrams of 40 multiplied by 512 and 20 multiplied by 1024 to output; three feature graphs with different sizes are output to enter a Neck layer, namely a feature pyramid, feature extraction is carried out, the output features are predicted, and a prediction result is output;

s5: in the training stage, visible light and infrared light data enter a double-flow neural network training after being subjected to specific Mosaic data enhancement, self-adaptive anchor frame calculation and self-adaptive picture scaling process; initializing by using a YOLOv5l pre-training weight, and learning parameters of a network by using a random gradient descent algorithm;

in the prediction stage, a softmax classifier is used for obtaining the final classification probability of the category to which the model belongs;

in the optimization stage, the error between the true value and the predicted value is reduced by adopting a mode of combining and optimizing the positioning loss, the classification loss and the confidence coefficient loss, and NWD is introduced into the positioning loss, so that the precision of small target detection is improved; and S5, continuously repeating the step until the iteration times reach the set iteration times, completing model training, and carrying out target detection tasks.

2. The method for detecting the multi-modal feature object based on the dynamic convolution and attention mechanism according to claim 1, wherein the method comprises the following steps: in the step S1, an input of a double-flow convolution target detection network frame based on YOLOv5 is an image pair of different modes, a Backbone is a double-flow convolution network, and a double-flow neural network model comprises Backbone, neck and a prediction layer;

let the input visible light characteristic diagram be X _V The input infrared light characteristic diagram is X _T The length, width and channel number of the feature map are H, W, C respectively;

the feature extraction network structure uses three feature fusion modules and a residual network to form a three-time feature extraction circulation and refinement structure, and the ith feature fusion calculation process is expressed as follows:

wherein sigma is a feature fusion function, and the visible light image input feature graph is X _V The infrared light image input characteristic diagram is X _T F is a feature fusion module for carrying out batch normalization operation; fusion feature mapThe length, the width and the channel number of the device are H, W and 2C respectively; and then constructing a residual error network by combining the fusion characteristic and the original characteristic:

obtaining new characteristic diagram f for visible light and infrared light _t ⁱ And

3. the method for detecting the multi-modal feature object based on the dynamic convolution and the attention mechanism according to claim 2, wherein the method comprises the following steps: in the step S2, the visible light and the infrared light are input into the image, the channel attention mechanism calculation is respectively carried out on the visible light and the infrared light, then the feature images are overlapped in a mode of overlapping the channel dimensions, and then the feature images are input into the space attention for operation;

the calculation of the MS-CBAM module is expressed as:

X＝M _S [concat[M _C (X _V ),M _C (X _T )]]

wherein M is _C Representing the channel attention mechanism, M _S Representing a spatial attention mechanism; concat means stacking the feature graphs in the channel dimension;

and then refining the X constructed residual error network, wherein the process is expressed as follows:

X' _V ＝X _V +X

X' _T ＝X _T +X

the finally obtained characteristic diagram is X' _V ∈V ^B×C×H×W 、X' _T ∈T ^B×C×H×W Representing the final output of the MS-CBAM module.

4. The method for detecting a multi-modal feature object based on dynamic convolution and attention mechanism as claimed in claim 3, wherein: in the step S3, a multi-head self-attention mechanism is introduced in the convolution process, and different attention coefficient matrixes ODConv of convolution are endowed to four dimensions of an input channel dimension, an output channel dimension, a space dimension and a convolution kernel, so that the capability of feature extraction is improved; the operation of the ODConv module as a whole is expressed as:

X'＝ODConv(concat(X _V ,X _T ))

wherein X is _V And X _T Respectively inputting feature graphs of visible light and infrared light modes, wherein concat represents that the two inputs are overlapped along the channel number dimension, and ODConv represents dynamic convolution operation;

it integrates the four-dimensional dynamic convolution formulas into:

y＝(α _w1 ⊙α _f1 ⊙α _c1 ⊙α _s1 ⊙W ₁ +...+α _wn ⊙α _fn ⊙α _cn ⊙α _sn ⊙W _n )*x

dimension W is the convolution kernel _i Attention coefficient matrix of->And->Respectively along the convolution kernel W _i Dynamic convolution attention coefficient matrix in the spatial dimension, input channel dimension, output channel dimension, +..

5. The method for detecting the multi-modal feature object based on the dynamic convolution and attention mechanism according to claim 4, wherein: the input and output of the MS-CBAM module and the ODConv module are both visible light and infrared light characteristic diagrams, and the output and the input form a residual error network;

the loss functions of the positioning loss, the classification loss and the confidence loss are expressed as follows:

L _total ＝L _box +L _cls +L _conf

wherein the positioning loss adopts an NWD loss function; the NWD loss function calculates the similarity by introducing Normalized Wasserstein Distance calculation method through the corresponding gaussian distribution.

6. The method for detecting the multi-modal feature object based on the dynamic convolution and attention mechanism according to claim 5, wherein: the NWD loss function is expressed as:

wherein,,for Wasserstein distance, +.>And C is a fixed constant related to the data set, so that the detection performance of the small target is improved.