CN114694001A

CN114694001A - A target detection method and device based on multimodal image fusion

Info

Publication number: CN114694001A
Application number: CN202210137919.XA
Authority: CN
Inventors: 张树; 马杰超; 俞益洲; 李一鸣; 乔昕
Original assignee: Beijing Shenrui Bolian Technology Co Ltd; Shenzhen Deepwise Bolian Technology Co Ltd
Current assignee: Beijing Shenrui Bolian Technology Co Ltd; Shenzhen Deepwise Bolian Technology Co Ltd
Priority date: 2022-02-15
Filing date: 2022-02-15
Publication date: 2022-07-01

Abstract

The invention provides a target detection method and device based on multi-modal image fusion. The method comprises the following steps: acquiring a video image and an infrared image in real time, and respectively inputting the video image and the infrared image into a target detection model formed by a transform; respectively extracting global features of the video image and the infrared image; fusing the extracted video image features and the infrared image features; and inputting the fusion characteristics of the video image and the infrared image into a prediction module consisting of a transform full-link layer, and outputting the target type and the target position. According to the method, a pure Transformer is utilized to construct a target detection model, so that model advantages brought by the overall structure of the Transformer can be fully exerted; the invention carries out target detection based on the feature fusion of the video image and the infrared image, can realize target detection under any illumination condition, and solves the problem of poor detection effect of the existing detection system in dark environments such as night and the like.

Description

A target detection method and device based on multimodal image fusion

技术领域technical field

本发明属于目标检测技术领域，具体涉及一种基于多模态图像融合的目标检测方法及装置。The invention belongs to the technical field of target detection, and in particular relates to a target detection method and device based on multimodal image fusion.

背景技术Background technique

长期以来，如何帮助视力障碍弱势群体获得更好的行动能力一直是备受关注的社会问题。及时、正确地感知周围环境是协助提升目标个体活动安全性和生活质量必不可少的条件。借助近年来迅猛发展的计算机视觉技术，基于卷积神经网络(CNN)的各种深度学习模型已经能够在对于自然场景图像的实时识别任务中表现出突出的能力，甚至拥有超越人类的准确性及稳定性，并被成功部署于产品之中，例如近来取得优秀成果的自动驾驶技术。For a long time, how to help disadvantaged people with visual impairment to obtain better mobility has been a social issue that has attracted much attention. Timely and correct perception of the surrounding environment is an essential condition to help improve the safety and quality of life of the target individual. With the rapid development of computer vision technology in recent years, various deep learning models based on convolutional neural networks (CNN) have been able to show outstanding capabilities in real-time recognition tasks for natural scene images, and even have an accuracy that surpasses that of humans. It is stable and has been successfully deployed in products, such as autonomous driving technology that has achieved excellent results recently.

一些不断涌现的针对视障人群研发的视觉辅助感知穿戴型电子设备也获益于此，借助设备上的微型摄像头或传感器采集实时场景中的图像或视频数据，由其搭载的模型进行对应的计算，从而为穿戴者提供场景目标检测的结果信息。然而，大多数目标检测模型都是基于亮度充足的可见光彩色图像数据进行建模的，这使得模型在接收环境光照条件较差(如夜晚、阴暗空间等生活中的场景)的可见光图像输入时性能大大降低，无法达到足够的识别能力，相应的视障辅助设备也就不能够为穿戴者及时地提供危险警报。Some wearable electronic devices with visual aids and perception developed for the visually impaired have also benefited from this. They collect images or video data in real-time scenes with the help of miniature cameras or sensors on the devices, and the models on them perform corresponding calculations. , so as to provide the wearer with the result information of scene target detection. However, most object detection models are modeled based on visible light color image data with sufficient brightness, which makes the model perform well when receiving visible light image input with poor ambient lighting conditions (such as night, dark spaces, and other life scenes). If it is greatly reduced, sufficient recognition ability cannot be achieved, and the corresponding visually impaired auxiliary equipment cannot provide the wearer with timely danger warnings.

发明内容SUMMARY OF THE INVENTION

为了解决现有技术中存在的上述问题，本发明提供一种基于多模态图像融合的目标检测方法及装置。In order to solve the above problems existing in the prior art, the present invention provides a target detection method and device based on multimodal image fusion.

为了实现上述目的，本发明采用以下技术方案。In order to achieve the above objects, the present invention adopts the following technical solutions.

第一方面，本发明提供一种基于多模态图像融合的目标检测方法，包括以下步骤：In a first aspect, the present invention provides a target detection method based on multimodal image fusion, comprising the following steps:

实时获取分别由视频摄像头和红外摄像头拍摄的视频图像和红外图像，并分别输入至由Transformer构成的目标检测模型；Real-time acquisition of video images and infrared images captured by video cameras and infrared cameras, respectively, and input to the target detection model composed of Transformer;

利用由Transformer编码器构成的特征编码模块对所述视频图像和红外图像分别进行全局特征提取；Utilize the feature encoding module composed of Transformer encoder to perform global feature extraction on the video image and the infrared image respectively;

利用由Transformer解码器构成的特征融合模块对提取的视频图像特征和红外图像特征进行融合；Use the feature fusion module composed of Transformer decoder to fuse the extracted video image features and infrared image features;

将视频图像和红外图像的融合特征输入由Transformer全连接层构成的预测模块，输出目标类别和目标位置。The fusion features of video images and infrared images are input into the prediction module composed of the Transformer fully connected layer, and the target category and target position are output.

进一步地，所述方法在进行全局特征提取前还包括对输入的视频图像和红外图像分别进行的如下操作：Further, the method further includes the following operations respectively performed on the input video image and the infrared image before performing the global feature extraction:

将图像切割成N个切片；Cut the image into N slices;

将每个切片在通道维度展开，输入至一个线性全连接层得到一个d维向量；Expand each slice in the channel dimension and input it to a linear fully connected layer to get a d-dimensional vector;

计算切片行和列方向的正余弦位置编码，并加至线性全连接层的输出得到N×d编码矩阵。The sine and cosine position codes of the slice row and column directions are calculated and added to the output of the linear fully connected layer to obtain an N×d coding matrix.

更进一步地，所述特征编码模块由Transformer编码器堆叠而成，每个Transformer编码器包括一个多头自注意力模块层和一个前馈网络层以及与每层相连的一个规范化层及残差单元；输入到多头自注意力模块的视频图像或红外图像的N×d编码矩阵，经过三种不同的线性变换得到大小为N×d′的查询向量、键向量和值向量，查询向量和键向量之间通过带缩放系数的向量点积计算相似度，并经softmax函数归一化后获得注意力权重矩阵，所述权重矩阵与值向量相乘后得到一路注意力结果；将多路注意力结果拼接后再映射回原来的维度d′，得到视频图像或红外图像的特征编码。Further, the feature encoding module is formed by stacking Transformer encoders, and each Transformer encoder includes a multi-head self-attention module layer and a feedforward network layer and a normalization layer and residual unit connected with each layer; The N×d encoding matrix of the video image or infrared image input to the multi-head self-attention module, through three different linear transformations, the query vector, key vector and value vector of size N×d′ are obtained. The similarity is calculated by the vector dot product with the scaling factor, and the attention weight matrix is obtained after normalization by the softmax function, and the weight matrix is multiplied by the value vector to obtain one attention result; Then map back to the original dimension d' to obtain the feature code of the video image or infrared image.

更进一步地，所述特征融合模块由Transformer解码器堆叠而成，每个Transformer解码器包括一个多头自注意力模块层、一个多头互注意力模块层和一个前馈网络层以及与每层相连的一个规范化层及残差单元；第i个Transformer解码器的多头互注意力模块层的询问向量Q_i来自多头自注意力模块层的输出，键向量K_i和值向量V_i分别来自特征编码模块输出的视频图像特征A和红外图像特征B；第i+1个Transformer解码器的多头互注意力模块层的询问向量Q_i+1来自多头自注意力模块层的输出，键向量K_i+1和值向量V_i+1分别来自B和A；键向量K_i和值向量V_i均为N×d′矩阵，询问向量Q_i为N′×d′矩阵，N′<N；i＝1,2,…。Further, the feature fusion module is formed by stacking Transformer decoders, and each Transformer decoder includes a multi-head self-attention module layer, a multi-head mutual attention module layer, a feedforward network layer and a layer connected to each layer. A normalization layer and residual unit; the query vector Q _i of the multi-head mutual attention module layer of the i-th Transformer decoder comes from the output of the multi-head self-attention module layer, and the key vector K _i and the value vector V _i come from the feature encoding module respectively. The output video image feature A and infrared image feature B; the query vector Q _i+1 of the multi-head mutual attention module layer of the i+1 Transformer decoder comes from the output of the multi-head self-attention module layer, and the key vector K _i+1 The sum value vector V _i+1 comes from B and A respectively; the key vector K _i and the value vector V _i are both N×d′ matrices, and the query vector Q _i is an N′×d′ matrix, N′<N; i=1 ,2,….

进一步地，所述方法还包括：根据目标类别和目标位置判断危险目标及其方位，并发出危险预警信息。Further, the method further includes: judging the dangerous target and its orientation according to the target category and the target position, and sending out danger warning information.

第二方面，本发明提供一种基于多模态图像融合的目标检测装置，包括：In a second aspect, the present invention provides a target detection device based on multimodal image fusion, including:

图像获取模块，用于实时获取分别由视频摄像头和红外摄像头拍摄的视频图像和红外图像，并分别输入至由Transformer构成的目标检测模型；The image acquisition module is used to acquire the video images and infrared images captured by the video camera and the infrared camera in real time, and input them to the target detection model composed of the Transformer respectively;

特征提取模块，用于利用由Transformer编码器构成的特征编码模块对所述视频图像和红外图像分别进行全局特征提取；A feature extraction module, used for using the feature encoding module formed by the Transformer encoder to perform global feature extraction on the video image and the infrared image respectively;

特征融合模块，用于利用由Transformer解码器构成的特征融合模块对提取的视频图像特征和红外图像特征进行融合；The feature fusion module is used to fuse the extracted video image features and infrared image features by using the feature fusion module composed of the Transformer decoder;

目标预测模块，用于将视频图像和红外图像的融合特征输入由Transformer全连接层构成的预测模块，输出目标类别和目标位置。The target prediction module is used to input the fusion features of the video image and the infrared image into the prediction module composed of the Transformer fully connected layer, and output the target category and target position.

进一步地，所述装置还包括向量嵌入模块，用于：Further, the device also includes a vector embedding module for:

将图像切割成N个切片；Cut the image into N slices;

进一步地，所述特征编码模块由Transformer编码器堆叠而成，每个Transformer编码器包括一个多头自注意力模块层和一个前馈网络层以及与每层相连的一个规范化层及残差单元；输入到多头自注意力模块的视频图像或红外图像的N×d编码矩阵，经过三种不同的线性变换得到大小为N×d′的查询向量、键向量和值向量，查询向量和键向量之间通过带缩放系数的向量点积计算相似度，并经softmax函数归一化后获得注意力权重矩阵，所述权重矩阵与值向量相乘后得到一路注意力结果；将多路注意力结果拼接后再映射回原来的维度d′，得到视频图像或红外图像的特征编码。Further, the feature encoding module is formed by stacking Transformer encoders, and each Transformer encoder includes a multi-head self-attention module layer and a feedforward network layer and a normalization layer and residual unit connected with each layer; input The N×d encoding matrix of the video image or infrared image to the multi-head self-attention module, after three different linear transformations, the query vector, key vector and value vector of size N×d′ are obtained. The similarity is calculated by the vector dot product with the scaling factor, and normalized by the softmax function to obtain the attention weight matrix. The weight matrix is multiplied by the value vector to obtain one attention result; after splicing the multi-way attention results Then map back to the original dimension d' to obtain the feature code of the video image or infrared image.

进一步地，所述装置还包括危险预警模块，用于根据目标类别和目标位置判断危险目标及其方位，并发出危险预警信息。Further, the device further includes a danger warning module, which is used for judging the dangerous target and its orientation according to the target category and target position, and sending out danger warning information.

与现有技术相比，本发明具有以下有益效果。Compared with the prior art, the present invention has the following beneficial effects.

本发明通过实时获取视频图像和红外图像，利用由纯Transformer构成的目标检测模型对所述视频图像和红外图像分别进行全局特征提取，并对提取的视频图像特征和红外图像特征进行融合，基于融合特征进行目标类别预测，实现了基于多模态图像融合的目标检测。本发明利用纯Transformer构建目标检测模型，可充分发挥Transformer整体结构带来的模型优势；本发明基于视频图像和红外图像的特征融合进行目标检测，可以实现任何光照条件下的目标检测，解决了现有检测系统在夜晚等黑暗环境下检测效果差的问题。The present invention obtains video images and infrared images in real time, uses a target detection model composed of pure Transformers to extract global features of the video images and infrared images respectively, and fuses the extracted video image features and infrared image features. The feature is used to predict the target category, and the target detection based on multimodal image fusion is realized. The invention uses pure Transformer to construct a target detection model, which can give full play to the model advantages brought by the overall structure of the Transformer; the invention performs target detection based on the feature fusion of video images and infrared images, can realize target detection under any lighting conditions, and solves the problem of current There is a problem that the detection system has a poor detection effect in dark environments such as night.

附图说明Description of drawings

图1为本发明实施例一种基于多模态图像融合的目标检测方法的流程图。FIG. 1 is a flowchart of a target detection method based on multimodal image fusion according to an embodiment of the present invention.

图2为本发明实施例的目标检测模型的整体结构示意图。FIG. 2 is a schematic diagram of an overall structure of a target detection model according to an embodiment of the present invention.

图3为自注意力机制原理示意图。Figure 3 is a schematic diagram of the principle of the self-attention mechanism.

图4为两个Transformer解码器的连接示意图。Figure 4 is a schematic diagram of the connection of two Transformer decoders.

图5为本发明实施例一种基于多模态图像融合的目标检测装置的方框图。FIG. 5 is a block diagram of a target detection apparatus based on multimodal image fusion according to an embodiment of the present invention.

具体实施方式Detailed ways

为使本发明的目的、技术方案及优点更加清楚、明白，以下结合附图及具体实施方式对本发明作进一步说明。显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。In order to make the objectives, technical solutions and advantages of the present invention clearer and more comprehensible, the present invention will be further described below with reference to the accompanying drawings and specific embodiments. Obviously, the described embodiments are only some, but not all, embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

图1为本发明实施例一种多模态图像融合的目标检测方法的流程图，包括以下步骤：1 is a flowchart of a target detection method for multimodal image fusion according to an embodiment of the present invention, including the following steps:

步骤101，实时获取分别由视频摄像头和红外摄像头拍摄的视频图像和红外图像，并分别输入至由Transformer构成的目标检测模型；Step 101, acquire video images and infrared images captured by a video camera and an infrared camera in real time, and input them into a target detection model formed by a Transformer respectively;

步骤102，利用由Transformer编码器构成的特征编码模块对所述视频图像和红外图像分别进行全局特征提取；Step 102, utilize the feature encoding module that Transformer encoder is formed to carry out global feature extraction to described video image and infrared image respectively;

步骤103，利用由Transformer解码器构成的特征融合模块对提取的视频图像特征和红外图像特征进行融合；Step 103, utilize the feature fusion module formed by Transformer decoder to fuse the extracted video image features and infrared image features;

步骤104，将视频图像和红外图像的融合特征输入由Transformer全连接层构成的预测模块，输出目标类别和目标位置。Step 104: Input the fusion feature of the video image and the infrared image into the prediction module composed of the Transformer fully connected layer, and output the target category and target position.

本实施例中，步骤101主要用于实时获取视频图像和红外图像。现有的用于辅助视障人员的目标检测模型，大多数都是基于亮度充足的可见光彩色图像数据进行建模的，这使得模型在接收环境光照条件较差(如夜晚、阴暗空间等生活中的场景)的可见光图像输入时性能大大降低，无法达到要求的识别能力。为此，本实施例在获取视频图像的同时还获取红外图像。由于红外线摄像头的成像原理不受光照条件的影响，所采集的红外图像能够在黑暗环境下提供场景目标信息的有力补充，因此基于视频和红外两种图像融合的目标检测模型，可在光照或黑暗场景下均具有高水平的泛化能力。本实施例的目标检测模型采用纯Transformer结构，充分发挥Transformer整体结构带来的模型优势，在图像识别任务上能够达到比卷积神经网络CNN模型更好的效果和泛化能力。目标检测模型的整体结构如图2所示。In this embodiment, step 101 is mainly used to acquire video images and infrared images in real time. Most of the existing target detection models for assisting visually impaired people are based on visible light color image data with sufficient brightness, which makes the model in poor ambient lighting conditions (such as night, dark spaces, etc. The performance of the visible light image input of the scene) is greatly reduced, and the required recognition ability cannot be achieved. To this end, this embodiment also acquires infrared images while acquiring video images. Since the imaging principle of the infrared camera is not affected by the lighting conditions, the collected infrared images can provide a powerful supplement to the scene target information in the dark environment. Therefore, the target detection model based on the fusion of video and infrared images can be used in light or darkness. It has a high level of generalization ability in all scenarios. The target detection model in this embodiment adopts a pure Transformer structure, which fully utilizes the model advantages brought by the overall Transformer structure, and can achieve better effects and generalization capabilities than the convolutional neural network CNN model in the image recognition task. The overall structure of the target detection model is shown in Figure 2.

本实施例中，步骤102主要用于进行图像特征提取。本实施例利用由Transformer编码器构成的特征编码模块对所述视频图像和红外图像分别进行特征提取。Transformer编码器采用注意力机制，主要由多头自注意力模块组成，能够提取输入图像的全局性特征，相比CNN只能提取图像局部特征对目标检测精度有很大改进。In this embodiment, step 102 is mainly used for image feature extraction. In this embodiment, a feature encoding module composed of a Transformer encoder is used to perform feature extraction on the video image and the infrared image respectively. The Transformer encoder adopts the attention mechanism, which is mainly composed of multi-head self-attention modules, which can extract the global features of the input image. Compared with CNN, which can only extract the local features of the image, the target detection accuracy is greatly improved.

本实施例中，步骤103主要用于进行多模态特征融合。本实施例利用由Transformer解码器构成的特征融合模块对提取的视频图像特征和红外图像特征进行融合。现有的基于CNN的网络结构模型在完成多模态图像融合任务时所采用的融合方案主要有三种，分别称为早期(early)、中期(middle)和晚期(late)融合：早期融合即在模型输入端将来自多模态的图像在通道维度上直接拼接(concatenate)起来作为整个网络的输入；中期融合为不同的模态拥有独立的特征提取器，利用各种定义的融合计算方式来融合各个模态在某一层次阶段的特征图；晚期融合则是将各模态经过相互独立的特征提取器的最终结果融合起来进行预测。不论是哪一种融合方式，事实上都只是简单地寻求融合的输入，而并没有足够的理论解释性支撑，或者说任务指向性，并且假设了模态之间的特征在空间位置上是一一对应的，卷积也只进行了局部的融合计算。然而，不同模态的图像乃至特征图是会存在一定的位置偏差的，只进行局部的计算可能会导致相应的特征无法对齐，这会使得融合效率低下，检测效果差。本实施例利用Transformer提供基于注意力(Transformer解码器包括多头自注意力模块和多头互注意力模块)的多模态融合方式来取代CNN，让不同模态的信息能够进行全局范围上的相互注意，从而不受到位置偏差带来的限制，使融合更有效、更具理论支撑性。In this embodiment, step 103 is mainly used to perform multi-modal feature fusion. In this embodiment, a feature fusion module composed of a Transformer decoder is used to fuse the extracted video image features and infrared image features. There are mainly three fusion schemes used by the existing CNN-based network structure models to complete the multimodal image fusion task, which are called early, middle and late fusion: The input end of the model concatenates images from multiple modalities directly in the channel dimension as the input of the entire network; in the mid-term fusion, different modalities have independent feature extractors, and use various defined fusion calculation methods to fuse The feature map of each modal at a certain level; late fusion is to fuse the final results of each modal through independent feature extractors for prediction. No matter which fusion method is used, in fact, it simply seeks the input of fusion, without sufficient theoretical explanatory support, or task orientation, and assumes that the characteristics between modalities are the same in space. Correspondingly, convolution only performs local fusion calculations. However, images of different modalities and even feature maps will have a certain positional deviation. Only performing local calculations may cause the corresponding features to not be aligned, which will lead to low fusion efficiency and poor detection effect. In this embodiment, the Transformer is used to provide a multi-modal fusion method based on attention (the Transformer decoder includes a multi-head self-attention module and a multi-head mutual attention module) to replace the CNN, so that the information of different modalities can be used for mutual attention on a global scale , so that it is not limited by the position deviation, making the fusion more effective and more theoretically supportive.

本实施例中，步骤104主要用于对目标类别进行预测。本实施例通过将视频图像和红外图像的融合特征输入由Transformer全连接层构成的预测模块，实现目标类别的预测。本实施例的目标是指对行动造成可能威胁的危险目标，根据危险等级划分目标类别，例如正前方的坑道或电线杆等为高危，侧方停放的自行车等为中危等。预测模块输出目标类别的同时一般还输出目标位置。预测模块由两个全连接层分支构成，其中一个分支由N1层全连接层组成，完成对目标类别的预测，另一个分支由N2层全连接层组成，完成对目标位置(检测框左上角和右下角坐标)的回归预测，从而实现目标检测任务。两个分支的输入相同，均为特征融合模块最后输出的融合特征。In this embodiment, step 104 is mainly used to predict the target category. In this embodiment, the prediction of the target category is realized by inputting the fusion feature of the video image and the infrared image into the prediction module composed of the Transformer fully connected layer. The target in this embodiment refers to a dangerous target that may pose a threat to action, and is classified into target categories according to the risk level. The prediction module generally outputs the target position while outputting the target category. The prediction module consists of two fully connected layer branches, one of which consists of an N1 layer fully connected layer to complete the prediction of the target category, and the other branch consists of an N2 layer fully connected layer to complete the target position (the upper left corner of the detection frame and The lower right corner coordinates) of the regression prediction, so as to achieve the target detection task. The inputs of the two branches are the same, which are the fusion features finally output by the feature fusion module.

作为一可选实施例，所述方法在进行全局特征提取前还包括对输入的视频图像和红外图像分别进行的如下操作：As an optional embodiment, before the global feature extraction is performed, the method further includes the following operations respectively performed on the input video image and the infrared image:

将图像切割成N个切片；Cut the image into N slices;

本实施例给出了对输入的视频图像和红外图像进行向量嵌入的一种技术方案。对于输入的视频图像和红外图像，需要先进行嵌入编码将其转换为Transformer可以接受的序列型输入。具体地，对于大小为C×H×W的图像输入，对其进行切片(patch)，不妨假设所切每个patch的空间大小为h×w，则能够得到N＝(H/h)×(W/w)张大小为C×h×w的切片。将每个切片沿通道C的维度展平得到C×h×w维的向量，N×(C×h×w)的矩阵输入一个线性全连接层中计算改变维度至d维。另外，为了让patch的编码含有二维的位置信息而不是呈现出排列不变性，通过对行和列方向分别计算固定的d维正弦或余弦位置编码，并加到线性层的输出上，最终获得N×d的矩阵即输入图像的线性嵌入编码表示，其中每一行的d维向量就是一个patch的代表向量，矩阵的行数N可以被称为代表向量的数量。需指出的是，N随所设置的patch大小不同而不同，N可以根据具体任务的实际需求灵活设置。This embodiment presents a technical solution for vector embedding of input video images and infrared images. For the input video image and infrared image, it is necessary to perform embedding encoding to convert it into a sequence input that the Transformer can accept. Specifically, for an image input with a size of C×H×W, slice it (patch), it may be assumed that the spatial size of each patch is h×w, then N=(H/h)×( W/w) slices of size C×h×w. Flatten each slice along the dimension of channel C to obtain a C×h×w-dimensional vector, and the N×(C×h×w) matrix is input into a linear fully connected layer to calculate and change the dimension to d-dimension. In addition, in order to make the patch code contain two-dimensional position information instead of showing arrangement invariance, a fixed d-dimensional sine or cosine position code is calculated for the row and column directions respectively, and added to the output of the linear layer, and finally obtained The N×d matrix is the linear embedded coding representation of the input image, in which the d-dimensional vector of each row is the representative vector of a patch, and the number of rows N of the matrix can be called the number of representative vectors. It should be pointed out that N varies with the size of the set patch, and N can be flexibly set according to the actual needs of specific tasks.

作为一可选实施例，所述特征编码模块由Transformer编码器堆叠而成，每个Transformer编码器包括一个多头自注意力模块层和一个前馈网络层以及与每层相连的一个规范化层及残差单元；输入到多头自注意力模块的视频图像或红外图像的N×d编码矩阵，经过三种不同的线性变换得到大小为N×d′的查询向量、键向量和值向量，查询向量和键向量之间通过带缩放系数的向量点积计算相似度，并经softmax函数归一化后获得注意力权重矩阵，所述权重矩阵与值向量相乘后得到一路注意力结果；将多路注意力结果拼接后再映射回原来的维度d′，得到视频图像或红外图像的特征编码。As an optional embodiment, the feature encoding module is formed by stacking Transformer encoders, and each Transformer encoder includes a multi-head self-attention module layer, a feedforward network layer, and a normalization layer and residual layer connected to each layer. Difference unit; the N×d coding matrix of the video image or infrared image input to the multi-head self-attention module, after three different linear transformations, the query vector, key vector and value vector of size N×d′ are obtained. The query vector and The similarity between the key vectors is calculated by the vector dot product with the scaling factor, and the attention weight matrix is obtained after normalization by the softmax function. The weight matrix is multiplied with the value vector to obtain one attention result; The force results are spliced and then mapped back to the original dimension d' to obtain the feature code of the video image or infrared image.

本实施例给出了进行特征提取的一种具体的技术方案。特征提取由特征编码模块实现，该模块由Transformer编码器堆叠而得，具体堆叠层数可以依据特定任务调试决定，并且对应两种图像的两个分支的编码器相互独立，堆叠层数可以相同也可以不同。每一个Transformer编码器的具体结构(按顺序)由一层多头自注意力模块层、一层前向传播模块层以及在每个层都施加的残差连接与标准化构成。其中自注意力机制的计算过程可以参照图3，输入的N×d编码矩阵分别经过线性映射函数W_q、W_k、W_v变换得到N×d′大小的查询向量(Query)、键向量(Key)、值向量(Value)，查询向量和键向量之间依据带缩放系数的向量点积计算相似度，并经过softmax函数归一化后，获得注意力权重矩阵，用公式表示如下：This embodiment presents a specific technical solution for feature extraction. Feature extraction is implemented by the feature encoding module, which is obtained by stacking Transformer encoders. The specific stacking layers can be determined according to specific task debugging, and the encoders corresponding to the two branches of the two images are independent of each other, and the number of stacked layers can be the same or not. can be different. The specific structure of each Transformer encoder consists (in order) of a multi-head self-attention module layer, a forward propagation module layer, and residual connections and normalization applied at each layer. The calculation process of the self-attention mechanism can refer to Figure 3. The input N×d coding matrix is transformed by the linear mapping functions W _q , W _k , and W _v to obtain the query vector (Query) and the key vector ( Key), value vector (Value), the similarity between the query vector and the key vector is calculated according to the vector dot product with scaling factor, and normalized by the softmax function to obtain the attention weight matrix, which is expressed as follows:

式中，α为权重矩阵，Q为查询向量，K^T为键向量的转置。权重矩阵用于与值向量相乘(即相当于依据该权重对value向量进行按列的加权求和来获得结果矩阵上某一点的数值)。多头自注意力即将该过程独立重复多次，将多次的结果拼接(concatenate)起来再映射回原来的特征维度d′。前向转播模块层即一个含有一层隐藏层的多层感知机(MLP)结构。经过Transformer编码器，输入图像能够对自身建模全局尺度上的特征编码，即每个代表向量都会和其他所有的代表向量包括自身计算相似度，拥有用CNN提取图像特征所不具备的全局性。where α is the weight matrix, Q is the query vector, and K ^T is the transpose of the key vector. The weight matrix is used to multiply the value vector (that is, equivalent to performing a column-wise weighted summation of the value vector according to the weight to obtain the value of a point on the result matrix). Multi-head self-attention is to repeat the process independently multiple times, concatenate the results of multiple times and map them back to the original feature dimension d'. The forward relay module layer is a multilayer perceptron (MLP) structure with one hidden layer. After the Transformer encoder, the input image can encode its own features on the global scale, that is, each representative vector will calculate the similarity with all other representative vectors, including itself, which has the globality that is not available in CNN extraction of image features.

作为一可选实施例，所述特征融合模块由Transformer解码器堆叠而成，每个Transformer解码器包括一个多头自注意力模块层、一个多头互注意力模块层和一个前馈网络层以及与每层相连的一个规范化层及残差单元；第i个Transformer解码器的多头互注意力模块层的询问向量Q_i来自多头自注意力模块层的输出，键向量K_i和值向量V_i分别来自特征编码模块输出的视频图像特征A和红外图像特征B；第i+1个Transformer解码器的多头互注意力模块层的询问向量Q_i+1来自多头自注意力模块层的输出，键向量K_i+1和值向量V_i+1分别来自B和A；键向量K_i和值向量V_i均为N×d′矩阵，询问向量Q_i为N′×d′矩阵，N′<N；i＝1,2,…。As an optional embodiment, the feature fusion module is formed by stacking Transformer decoders, each Transformer decoder includes a multi-head self-attention module layer, a multi-head mutual attention module layer and a feedforward network layer A normalization layer and a residual unit connected by layers; the query vector Q _i of the multi-head mutual attention module layer of the i-th Transformer decoder comes from the output of the multi-head self-attention module layer, and the key vector K _i and value vector _Vi come from respectively The video image feature A and infrared image feature B output by the feature encoding module; the query vector Q _i+1 of the multi-head mutual attention module layer of the i+1 Transformer decoder comes from the output of the multi-head self-attention module layer, and the key vector K _i+1 and value vector V _i+1 come from B and A respectively; both key vector K _i and value vector V _i are N×d′ matrices, query vector Q _i is N′×d′ matrix, N′<N; i = 1, 2, . . .

本实施例给出了进行特征融合的一种具体的技术方案。两种模态图像特征的融合由特征融合模块实现，该模块由Transformer解码器堆叠而得，连续堆叠两个Transformer解码器的结构示意图如图4所示。同样地，解码器具体堆叠层数可由特定任务调试决定。每一个Transformer解码器的详细结构(按顺序)由一层多头自注意力模块层、一层多头互注意力模块层、前向转播模块层以及每一层都施加的残差连接与标准化组成。其中的多头自注意力模块层以及前向传播模块层和Transformer编码器中的相同。多头互注意力模块层的计算机制与自注意力是相同的，唯一不同的是其所接收的查询向量来自前面多头自注意力模块层的输出，键向量和值向量则分别来自特征编码模块输出的视频图像特征A和红外图像特征B。值得说明的是，相邻解码器的所述键向量和值向量连接的图像特征A、B的顺序正好相反，比如，如果当前解码器的所述键向量和值向量分别连接A、B，则上一解码器和下一解码器的所述键向量和值向量分别连接B、A，从而能够实现查询向量交替对两个模态的特征进行注意力计算并融合。这样设计能够有效地平衡两种模态之间可能存在的一些信息偏差，包括位置偏差，既提取出分布相近的有效内容，又建模了全局上可能存在的关键相互关系。但需注意的是，需要为第一层Transformer解码器单独初始化特殊定义的查询向量作为输入，该查询向量为一组可学习的参数，能够隐式地学习如何提取多模态图像中存在目标的区域的位置编码，并在融合中起到中介作用，具备良好的任务指向性和先验性，是完成目标检测任务以及多模态融合任务的关键组成部分。该查询向量的维度与模态图像编码的维度相同，但大小N′(或数量，即编码矩阵的行数)应远小于模态图像编码的数量N，即N′<<N，稍大于数据图像中待检测目标数量的最大值，从而能够减少漏检，以及在注意力计算的过程中仅交互必要的特征，减少信息的冗余，同时大大降低计算成本。This embodiment presents a specific technical solution for feature fusion. The fusion of two modal image features is realized by the feature fusion module, which is obtained by stacking Transformer decoders. The schematic diagram of the structure of stacking two Transformer decoders in succession is shown in Figure 4. Likewise, the specific stacking layers of the decoder can be determined by task-specific debugging. The detailed structure of each Transformer decoder (in order) consists of a multi-head self-attention module layer, a multi-head mutual attention module layer, a forward relay module layer, and residual connections and normalization applied at each layer. The multi-head self-attention module layer and the forward propagation module layer are the same as those in the Transformer encoder. The computing mechanism of the multi-head mutual attention module layer is the same as that of self-attention, the only difference is that the query vector it receives comes from the output of the previous multi-head self-attention module layer, and the key vector and value vector come from the output of the feature encoding module respectively. The video image feature A and the infrared image feature B. It is worth noting that the order of the image features A and B connected by the key vector and the value vector of adjacent decoders is just opposite. For example, if the key vector and the value vector of the current decoder are connected to A and B respectively, then The key vector and the value vector of the previous decoder and the next decoder are respectively connected to B and A, so that the query vector can alternately perform attention calculation and fusion on the features of the two modalities. This design can effectively balance some information deviations that may exist between the two modalities, including positional deviations, not only extract effective content with similar distributions, but also model the key global relationships that may exist. However, it should be noted that a specially defined query vector needs to be initialized separately for the first layer Transformer decoder as input. The query vector is a set of learnable parameters that can implicitly learn how to extract the presence of targets in multimodal images. The location of the region encodes and plays a mediating role in the fusion. It has good task orientation and a priori, and is a key component to complete the target detection task and the multimodal fusion task. The dimension of this query vector is the same as that of the modality image encoding, but the size N' (or the number, i.e. the number of rows of the encoding matrix) should be much smaller than the number N of the modality image encoding, i.e. N'<<N, slightly larger than the data The maximum number of objects to be detected in the image can reduce missed detections, and only interact with necessary features in the process of attention calculation, reduce information redundancy, and greatly reduce computational costs.

作为一可选实施例，所述方法还包括：根据输出的目标类别和目标位置判断危险目标及其方位，并发出危险预警信息。As an optional embodiment, the method further includes: judging the dangerous target and its orientation according to the output target category and target position, and sending out danger warning information.

本实施例给出了进行危险预警的一种技术方案。危险预警属于后处理步骤，本实施例基于预测模块输出的目标类别和目标位置判断危险目标，并计算目标相对用户的方位(还可包括距离)，最后通过语音模块向用户发出报警信息，提醒用户引起注意或进行规避。This embodiment presents a technical solution for danger early warning. The danger warning belongs to the post-processing step. In this embodiment, the dangerous target is judged based on the target category and target position output by the prediction module, and the orientation of the target relative to the user (which may also include the distance) is calculated. Finally, an alarm message is sent to the user through the voice module to remind the user. To draw attention or to evade.

图5为本发明实施例一种多模态图像融合的目标检测装置的组成示意图，所述装置包括：FIG. 5 is a schematic diagram of the composition of a target detection device for multi-modal image fusion according to an embodiment of the present invention, and the device includes:

图像获取模块11，用于实时获取分别由视频摄像头和红外摄像头拍摄的视频图像和红外图像，并分别输入至由Transformer构成的目标检测模型；The image acquisition module 11 is used for real-time acquisition of video images and infrared images captured by the video camera and the infrared camera, respectively, and input to the target detection model formed by the Transformer;

特征提取模块12，用于利用由Transformer编码器构成的特征编码模块对所述视频图像和红外图像分别进行全局特征提取；The feature extraction module 12 is used to perform global feature extraction on the video image and the infrared image respectively by utilizing the feature encoding module formed by the Transformer encoder;

特征融合模块13，用于利用由Transformer解码器构成的特征融合模块对提取的视频图像特征和红外图像特征进行融合；The feature fusion module 13 is used to fuse the extracted video image features and infrared image features by using the feature fusion module formed by the Transformer decoder;

目标预测模块14，用于将视频图像和红外图像的融合特征输入由Transformer全连接层构成的预测模块，输出目标类别和目标位置。The target prediction module 14 is used to input the fusion feature of the video image and the infrared image into the prediction module composed of the Transformer fully connected layer, and output the target category and the target position.

本实施例的装置，可以用于执行图1所示方法实施例的技术方案，其实现原理和技术效果类似，此处不再赘述。后面的实施例也是如此，均不再展开说明。The apparatus of this embodiment can be used to execute the technical solution of the method embodiment shown in FIG. 1 , and the implementation principle and technical effect thereof are similar, and are not repeated here. The same is true for the following embodiments, which will not be further described.

作为一可选实施例，所述装置还包括向量嵌入模块，用于：As an optional embodiment, the apparatus further includes a vector embedding module for:

将图像切割成N个切片；Cut the image into N slices;

作为一可选实施例，所述装置还包括危险预警模块，用于根据目标类别和目标位置判断危险目标及其方位，并发出危险预警信息。As an optional embodiment, the device further includes a danger early warning module, configured to judge the dangerous target and its orientation according to the target category and target location, and send out danger warning information.

以上所述，仅为本发明的具体实施方式，但本发明的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本发明揭露的技术范围内，可轻易想到的变化或替换，都应涵盖在本发明的保护范围之内。因此，本发明的保护范围应以权利要求的保护范围为准。The above are only specific embodiments of the present invention, but the protection scope of the present invention is not limited thereto. Any person skilled in the art who is familiar with the technical scope disclosed by the present invention can easily think of changes or substitutions. All should be included within the protection scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims

1. a target detection method based on multimodal image fusion, is characterized in that, comprises the following steps:

Real-time acquisition of video images and infrared images captured by video cameras and infrared cameras, respectively, and input to the target detection model composed of Transformer;

Utilize the feature encoding module formed by the Transformer encoder to perform global feature extraction on the video image and the infrared image respectively;

Use the feature fusion module composed of Transformer decoder to fuse the extracted video image features and infrared image features;

The fusion features of video images and infrared images are input into the prediction module composed of the Transformer fully connected layer, and the target category and target position are output.

2. the target detection method based on multimodal image fusion according to claim 1, is characterized in that, described method also comprises the following operations that the video image of input and infrared image are carried out respectively before carrying out global feature extraction:

Cut the image into N slices;

Expand each slice in the channel dimension and input it to a linear fully connected layer to get a d-dimensional vector;

The sine and cosine position codes of the slice row and column directions are calculated and added to the output of the linear fully connected layer to obtain an N×d coding matrix.

3. the target detection method based on multimodal image fusion according to claim 2, is characterized in that, described feature encoding module is formed by stacking Transformer encoder, and each Transformer encoder comprises a multi-head self-attention module layer and a feedforward network layer and a normalization layer and residual unit connected to each layer; the N×d encoding matrix of the video image or infrared image input to the multi-head self-attention module, after three different linear transformations, the size is N×d' query vector, key vector and value vector, the similarity between the query vector and the key vector is calculated by the vector dot product with scaling factor, and normalized by the softmax function to obtain the attention weight matrix, the weight The matrix and the value vector are multiplied to obtain one-way attention results; the multi-way attention results are spliced and then mapped back to the original dimension d' to obtain the feature encoding of the video image or infrared image.

4. The target detection method based on multimodal image fusion according to claim 3, wherein the feature fusion module is formed by stacking Transformer decoders, and each Transformer decoder includes a multi-head self-attention module layer , a multi-head mutual attention module layer and a feedforward network layer, and a normalization layer and residual unit connected to each layer; the query vector Q _i of the multi-head mutual attention module layer of the i-th Transformer decoder comes from multi-head self-attention The output of the force module layer, the key vector K _i and the value vector V _i come from the video image feature A and the infrared image feature B output by the feature encoding module respectively; the query vector of the multi-head mutual attention module layer of the i+1 Transformer decoder Q _i+1 comes from the output of the multi-head self-attention module layer, and the key vector K _i+1 and the value vector V _i+1 come from B and A, respectively; the key vector K _i and the value vector V _i are both N×d' matrices, The query vector Q _i is an N'×d' matrix, N'<N; i=1, 2, . . .

5 . The target detection method based on multimodal image fusion according to claim 1 , wherein the method further comprises: judging the dangerous target and its orientation according to the target category and target location, and issuing danger warning information. 6 .

6. A target detection device based on multimodal image fusion, characterized in that, comprising:

The image acquisition module is used to acquire the video images and infrared images captured by the video camera and the infrared camera in real time, and input them to the target detection model composed of the Transformer respectively;

A feature extraction module, used for using the feature encoding module formed by the Transformer encoder to perform global feature extraction on the video image and the infrared image respectively;

The feature fusion module is used to fuse the extracted video image features and infrared image features by using the feature fusion module composed of the Transformer decoder;

The target prediction module is used to input the fusion features of the video image and the infrared image into the prediction module composed of the Transformer fully connected layer, and output the target category and target position.

7. The target detection device based on multimodal image fusion according to claim 6, wherein the device further comprises a vector embedding module for:

Cut the image into N slices;

The sine and cosine position codes of the slice row and column directions are computed and added to the output of the linear fully connected layer to obtain an N×d coding matrix.

8. The target detection device based on multimodal image fusion according to claim 7, wherein the feature encoding module is formed by stacking Transformer encoders, and each Transformer encoder comprises a multi-head self-attention module layer and a feedforward network layer and a normalization layer and residual unit connected to each layer; the N×d encoding matrix of the video image or infrared image input to the multi-head self-attention module, after three different linear transformations, the size is N×d' query vector, key vector and value vector, the similarity between the query vector and the key vector is calculated by the vector dot product with scaling factor, and normalized by the softmax function to obtain the attention weight matrix, the weight The matrix and the value vector are multiplied to obtain one-way attention result; the multi-way attention results are spliced and then mapped back to the original dimension d' to obtain the feature encoding of the video image or infrared image.

9. The target detection device based on multimodal image fusion according to claim 8, wherein the feature fusion module is formed by stacking Transformer decoders, and each Transformer decoder includes a multi-head self-attention module layer , a multi-head mutual attention module layer and a feed-forward network layer, and a normalization layer and residual unit connected to each layer; the query vector Q _i of the multi-head mutual attention module layer of the i-th Transformer decoder comes from multi-head self-attention The output of the force module layer, the key vector K _i and the value vector V _i come from the video image feature A and infrared image feature B output by the feature encoding module, respectively; the query vector of the multi-head mutual attention module layer of the i+1 Transformer decoder Q _i+1 comes from the output of the multi-head self-attention module layer, and the key vector K _i+1 and the value vector V _i+1 come from B and A, respectively; the key vector K _i and the value vector V _i are both N×d' matrices, The query vector Q _i is an N'×d' matrix, N'<N; i=1, 2, . . .

10 . The target detection device based on multimodal image fusion according to claim 6 , wherein the device further comprises a danger warning module for judging the dangerous target and its orientation according to the target category and target position, and sending out the 10 . Hazard warning information.