CN112418163A

CN112418163A - A Multispectral Target Detection Guide System

Info

Publication number: CN112418163A
Application number: CN202011426982.2A
Authority: CN
Inventors: 石德君; 张树; 俞益洲; 李一鸣; 乔昕
Original assignee: Beijing Shenrui Bolian Technology Co Ltd; Shenzhen Deepwise Bolian Technology Co Ltd
Current assignee: Beijing Shenrui Bolian Technology Co Ltd; Shenzhen Deepwise Bolian Technology Co Ltd
Priority date: 2020-12-09
Filing date: 2020-12-09
Publication date: 2021-02-26
Anticipated expiration: 2040-12-09
Also published as: CN112418163B

Abstract

The invention provides a multi-spectral target detection guidance system, comprising: a data input module for acquiring visible light image and infrared thermal image; a deformable feature extractor module for extracting visible light image feature map and infrared thermal image feature respectively Figure; candidate frame extraction network, used to extract visible light image candidate frame and infrared thermal image candidate frame; candidate frame complementary module, used to add the part of the visible light image candidate frame that is not covered by the infrared thermal image candidate frame to the infrared thermal image In the candidate frame, the part of the infrared thermal image candidate frame that is not covered by the visible light image candidate frame is added to the visible light image candidate frame to obtain the visible light image region feature map and the infrared thermal image region feature map; cross-modal attention fusion module , which is used to fuse the visible light image region feature map into the infrared thermal image region feature map according to the similarity between the features of each region to obtain enhanced thermal image features; the classification and regression modules are used to obtain target detection results.

Description

A Multispectral Target Detection Guide System

技术领域technical field

本发明涉及计算机领域，尤其涉及一种多光谱目标检测导盲系统。The invention relates to the field of computers, in particular to a multispectral target detection guidance system.

背景技术Background technique

近年来计算机视觉的巨大发展给导盲系统带来新了的机遇和可能。基于卷积神经网络(CNN)的深度学习模型已经在图像分类(ImageNet数据集)和目标检测(COCO数据集)任务中达到甚至超越了人类水平。基于深度学习技术的视觉感知系统(特别是物体检测系统)在无人驾驶等应用中取得了不错的效果。因此，利用这一技术辅助盲人感知环境形成了新趋势。然而，以往的目标检测模型普遍都是基于可见光彩色图像构建的，适用场景受限于照明条件，无法应用到夜晚或有强光照射的地方。相似的，利用此技术的导盲系统也存在这一问题，无法全天候协助盲人感知环境。The tremendous development of computer vision in recent years has brought new opportunities and possibilities to the guidance system. Deep learning models based on Convolutional Neural Networks (CNN) have reached or surpassed human-level performance in image classification (ImageNet dataset) and object detection (COCO dataset) tasks. Visual perception systems (especially object detection systems) based on deep learning technology have achieved good results in applications such as unmanned driving. Therefore, the use of this technology to assist the blind to perceive the environment has formed a new trend. However, the previous target detection models are generally constructed based on visible light color images, and the applicable scenes are limited by the lighting conditions, and cannot be applied at night or in places with strong light. Similarly, the blind guidance system using this technology also has this problem and cannot assist the blind to perceive the environment around the clock.

发明内容SUMMARY OF THE INVENTION

本发明旨在提供一种克服上述问题或者至少部分地解决上述问题的多光谱目标检测导盲系统。The present invention aims to provide a multispectral target detection guidance system that overcomes the above problems or at least partially solves the above problems.

为达到上述目的，本发明的技术方案具体是这样实现的：In order to achieve the above object, the technical scheme of the present invention is specifically realized in this way:

本发明的一个方面提供了一种多光谱目标检测导盲系统，包括：数据输入模块，用于获取可见光图像和红外热图像；可变形特征提取器模块，用于采用可变形卷积，分别提取可见光图像和红外热图像的图像特征，输出可见光图像特征图和红外热图像特征图；候选框提取网络，用于根据可见光图像特征图和红外热图像特征图提取目标物体的候选框，得到可见光图像候选框和红外热图像候选框；候选框互补模块，用于将可见光图像候选框中没有被红外热图像候选框覆盖到的部分添加到红外热图像候选框中，将红外热图像候选框中没有被可见光图像候选框覆盖到的部分添加到可见光图像候选框中，得到可见光图像区域特征图和红外热图像区域特征图；跨模态注意力融合模块，用于将红外热图像区域特征图作为查询向量，将可见光图像区域特征图作为钥匙向量和价值向量，参照自注意力模块将可见光图像区域特征图根据各区域特征间的相似关系融合到红外热图像区域特征图中，得到经过彩色图像特征加强了的热图像特征；分类和回归模块，用于将经过彩色图像特征加强了的热图像特征以及可见光图像区域特征图进行卷积计算，得到目标检测结果，其中，目标检测结果包括：各区域的类别和候选框偏移量。One aspect of the present invention provides a multispectral target detection guidance system, including: a data input module for acquiring visible light images and infrared thermal images; a deformable feature extractor module for using deformable convolution to extract the The image features of the visible light image and the infrared thermal image, and output the visible light image feature map and the infrared thermal image feature map; the candidate frame extraction network is used to extract the candidate frame of the target object according to the visible light image feature map and the infrared thermal image feature map to obtain the visible light image. The candidate frame and the infrared thermal image candidate frame; the candidate frame complementary module is used to add the part of the visible light image candidate frame that is not covered by the infrared thermal image candidate frame to the infrared thermal image candidate frame, and the infrared thermal image candidate frame is not covered. The part covered by the visible light image candidate frame is added to the visible light image candidate frame to obtain the visible light image region feature map and the infrared thermal image region feature map; the cross-modal attention fusion module is used to query the infrared thermal image region feature map vector, using the visible light image region feature map as the key vector and the value vector, referring to the self-attention module, the visible light image region feature map is fused into the infrared thermal image region feature map according to the similarity between the features of each region, and the color image feature enhancement is obtained. The classification and regression module is used to perform convolution calculation on the thermal image features enhanced by the color image features and the visible light image region feature maps to obtain the target detection results, wherein the target detection results include: Category and candidate box offsets.

其中，数据输入模块，还用于确定训练目标的类别和位置；系统还包括：损失计算模块，用于根据目标检测结果和训练目标，采用损失函数计算出模型在框回归和框分类两个任务中的综合预测误差，回传误差的梯度，更新模型参数，进行模型训练，不断迭代，模型的预测误差将不断下降直到收敛，得到可应用部署的模型。Among them, the data input module is also used to determine the category and position of the training target; the system also includes: a loss calculation module, which is used to calculate the two tasks of the model in box regression and box classification according to the target detection result and the training target by using the loss function The comprehensive prediction error in , returns the gradient of the error, updates the model parameters, performs model training, and iterates continuously. The prediction error of the model will continue to decrease until convergence, and a model that can be applied and deployed is obtained.

其中，可变形特征提取器模块包括：第一可变形特征提取器，用于提取可见光图像的图像特征，得到可见光图像特征图；第二可变形特征提取器，用于提取红外热图像的图像特征，得到红外热图像特征图；可见光图像特征图和红外热图像特征图大小相同。The deformable feature extractor module includes: a first deformable feature extractor for extracting image features of visible light images to obtain visible light image feature maps; a second deformable feature extractor for extracting image features of infrared thermal images , the infrared thermal image feature map is obtained; the visible light image feature map and the infrared thermal image feature map have the same size.

其中，可变形卷积公式为：

其中，

为常规卷积操作公式，x表示输入特征图，y表示输出特征图，p是当前待计算的像素点位置(w₀,h₀)，k表示卷积范围内的位置序号，p_k是相对p的位置偏移，w_k表示k点位置所对应的权重，Δp_k表示卷积中k点额外增加的位置偏移量，Δm_k表示卷积中k点的额外权重。Among them, the deformable convolution formula is:

in,

It is a conventional convolution operation formula, x represents the input feature map, y represents the output feature map, p is the current pixel position to be calculated (w ₀ , h ₀ ), k represents the position number within the convolution range, and p _k is the relative The position offset of p, w _k represents the weight corresponding to the position of the k point, Δp _k represents the additional position offset of the k point in the convolution, and Δm _k represents the additional weight of the k point in the convolution.

其中，第一可变形特征提取器和第二可变形特征提取器分别独立学习w_k、Δp_k和Δm_k；或者第一可变形特征提取器和第二可变形特征提取器分别独立学习w_k，共享学习Δp_k和Δm_k。Wherein, the first deformable feature extractor and the second deformable feature extractor learn w _k , Δp _k and Δm _k independently, respectively; or the first deformable feature extractor and the second deformable feature extractor learn w _k independently, respectively , shared learning Δp _k and Δm _k .

其中，候选框提取网络包括：第一候选框提取网络，用于连接第一可变形特征提取器，提取可见光图像特征图中存在物体的可见光图像候选框；第二候选框提取网络，用于连接第二可变形特征提取器，提取红外热图像特征图中存在物体的红外热图像候选框。Wherein, the candidate frame extraction network includes: a first candidate frame extraction network, which is used for connecting the first deformable feature extractor to extract visible light image candidate frames in which objects exist in the visible light image feature map; the second candidate frame extraction network is used for connecting The second deformable feature extractor extracts the infrared thermal image candidate frame of the object in the infrared thermal image feature map.

其中，候选框互补模块，具体用于将可见光图像候选框中没有被红外热图像候选框覆盖到的部分添加到红外热图像候选框中，将红外热图像候选框中没有被可见光图像候选框覆盖到的部分添加到可见光图像候选框中，根据选定的候选框，将初始特征图上对应位置大小各异的区域特征提取出来，经过区域池化层将各区域特征统一到相同尺寸，得到尺寸相同的可见光图像区域特征图和红外热图像区域特征图。Among them, the candidate frame complementary module is specifically used to add the part of the visible light image candidate frame that is not covered by the infrared thermal image candidate frame to the infrared thermal image candidate frame, and to add the infrared thermal image candidate frame that is not covered by the visible light image candidate frame. The obtained part is added to the visible light image candidate frame. According to the selected candidate frame, the regional features of different corresponding positions and sizes on the initial feature map are extracted, and the regional features are unified to the same size through the regional pooling layer to obtain the size. The same visible light image area feature map and infrared thermal image area feature map.

其中，跨模态注意力融合模块，跨模态注意力融合模块，具体用于将红外热图像区域特征图和可见光图像区域特征图经过独立的卷积进行降维，计算红外热图像区域特征图和可见光图像区域特征图中各区域特征之间的俩俩相似关系，得到关系矩阵，对相似度做权值归一化，可见光图像区域特征图中特征经过卷积，与关系矩阵的矩阵乘，输出双模态互补增强区域特征，得到经过彩色图像特征加强了的热图像特征。Among them, the cross-modal attention fusion module and the cross-modal attention fusion module are specifically used to reduce the dimension of the infrared thermal image region feature map and the visible light image region feature map through independent convolution, and calculate the infrared thermal image region feature map and the two similar relationships between the features of each region in the feature map of the visible light image region to obtain a relationship matrix, and normalize the weights of the similarity. The dual-modal complementary enhanced region features are output, and the thermal image features enhanced by the color image features are obtained.

其中，跨模态注意力融合模块，还有将经过彩色图像特征加强了的热图像特征与红外热图像区域特征图相加或合并；分类和回归模块，还用于将经过彩色图像特征加强了的热图像特征与红外热图像区域特征图相加或合并的特征图以及可见光图像区域特征图进行卷积计算，得到目标检测结果，其中，目标检测结果包括：各区域的类别和候选框偏移量。Among them, the cross-modal attention fusion module also adds or merges the thermal image features enhanced by the color image features with the infrared thermal image region feature maps; the classification and regression module is also used to enhance the color image features. The thermal image features of the infrared thermal image are added or combined with the infrared thermal image region feature map and the visible light image region feature map to perform convolution calculation to obtain the target detection result, wherein the target detection result includes: the category of each region and the candidate frame offset quantity.

由此可见，通过本发明提供的多光谱目标检测导盲系统，结合可见光彩色图像和红外热图像构建一个全天候的端到端多模态/多光谱目标检测导盲系统，解决现有导盲系统在无光照、低光照或过强光照场景下不支持或效果差的问题。It can be seen that, through the multi-spectral target detection guidance system provided by the present invention, an all-weather end-to-end multi-modal/multi-spectral target detection guidance system is constructed by combining visible light color images and infrared thermal images, so as to solve the problem of the existing guidance system. Issues that are not supported or perform poorly in no-light, low-light, or high-light scenes.

附图说明Description of drawings

为了更清楚地说明本发明实施例的技术方案，下面将对实施例描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域的普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他附图。In order to illustrate the technical solutions of the embodiments of the present invention more clearly, the following briefly introduces the accompanying drawings used in the description of the embodiments. Obviously, the drawings in the following description are only some embodiments of the present invention. For those of ordinary skill in the art, other drawings can also be obtained from these drawings without any creative effort.

图1为本发明实施例提供的多光谱目标检测导盲系统的结构示意图；1 is a schematic structural diagram of a multispectral target detection blind guidance system provided by an embodiment of the present invention;

图2为本发明实施例提供的多光谱目标检测导盲系统具体结构示意图；2 is a schematic structural diagram of a specific structure of a multispectral target detection blind guidance system provided by an embodiment of the present invention;

图3为本发明实施例提供的跨注意力融合模块示意图。FIG. 3 is a schematic diagram of a cross-attention fusion module provided by an embodiment of the present invention.

具体实施方式Detailed ways

下面将参照附图更详细地描述本公开的示例性实施例。虽然附图中显示了本公开的示例性实施例，然而应当理解，可以以各种形式实现本公开而不应被这里阐述的实施例所限制。相反，提供这些实施例是为了能够更透彻地理解本公开，并且能够将本公开的范围完整的传达给本领域的技术人员。Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited by the embodiments set forth herein. Rather, these embodiments are provided so that the present disclosure will be more thoroughly understood, and will fully convey the scope of the present disclosure to those skilled in the art.

本发明的核心在于：The core of the present invention is:

现有的多光谱/多模态目标检测系统普遍假设针对同一场景的彩色图像和热图像是完全对齐的，但实际上并非如此，两种模态的图像往往存在位置偏移。这种错误假设将导致检测系统出现误差甚至失效。由于目前多模态数据的融合多使用逐像素的方式进行，不仅降低了对齐鲁棒性也影响了互补特征融合的有效性。本发明旨在提出新的解决方案以应对上述问题。Existing multispectral/multimodal object detection systems generally assume that the color image and thermal image of the same scene are perfectly aligned, but this is not the case, and the images of the two modalities often have positional offsets. This wrong assumption will lead to errors or even failure of the detection system. Since the fusion of multimodal data is currently performed in a pixel-by-pixel manner, it not only reduces the alignment robustness but also affects the effectiveness of complementary feature fusion. The present invention aims to propose a new solution to the above-mentioned problems.

本发明在网络设计中考虑了不同模态图像存在位置偏移这一情况，一方面让网络隐式学习两种模态图像的对齐关系，从而避免以往系统可能出现的差错；另一方面引入感兴趣区域(ROI)级别的特征融合模块进一步提高对不对齐问题的鲁棒性。另外，该方案无需额外标注，节省成本。The present invention takes into account the positional offset of different modal images in the network design. On the one hand, the network implicitly learns the alignment relationship of the two modal images, so as to avoid possible errors in previous systems; on the other hand, it introduces a sense of A region-of-interest (ROI)-level feature fusion module further improves the robustness to misalignment problems. In addition, this scheme does not require additional marking, which saves costs.

需要特别指出的是，融合模块是一步法多光谱目标检测系统的核心，因为它决定了一个系统如何利用多种模态图像的信息来提升预测表现。在以往的系统中，无论融合模块的位置在哪里，其融合方式都是非常朴素的，如相加、合并(concat)或对应位置的加权，这些方式并没有有效地利用不同模态的互补信息，限制了模型在复杂现实场景中的泛化能力。针对特征融合部分，本发明则提出候选框互补模块和跨模态注意力模块两个模块，更加充分利用两种模态的相关信息，实现对两种模态相互关系更加全面的建模，有目的地促进两种模态的特征在网络中的信息交流，通过互通有无提升系统的精度和泛化能力。It should be pointed out that the fusion module is the core of the one-step multispectral object detection system, because it determines how a system can use the information of multiple modal images to improve the prediction performance. In previous systems, no matter where the fusion module is located, the fusion methods are very simple, such as addition, concat or weighting of corresponding positions, which do not effectively utilize the complementary information of different modalities , which limits the generalization ability of the model in complex real-world scenarios. For the feature fusion part, the present invention proposes two modules, the candidate frame complementary module and the cross-modal attention module, which make full use of the relevant information of the two modalities and realize a more comprehensive modeling of the relationship between the two modalities. The destination promotes the information exchange of the features of the two modalities in the network, and improves the accuracy and generalization ability of the system through the exchange of presence or absence.

图1示出了本发明实施例提供的多光谱目标检测导盲系统的结构示意图，参见图1，本发明实施例提供的多光谱目标检测导盲系统，包括：FIG. 1 shows a schematic structural diagram of a multispectral target detection blind guidance system provided by an embodiment of the present invention. Referring to FIG. 1 , the multispectral target detection blind guidance system provided by an embodiment of the present invention includes:

数据输入模块，用于获取可见光图像和红外热图像；Data input module for acquiring visible light images and infrared thermal images;

可变形特征提取器模块，用于采用可变形卷积，分别提取可见光图像和红外热图像的图像特征，输出可见光图像特征图和红外热图像特征图；The deformable feature extractor module is used to use deformable convolution to extract the image features of the visible light image and the infrared thermal image respectively, and output the visible light image feature map and the infrared thermal image feature map;

候选框提取网络，用于根据可见光图像特征图和红外热图像特征图提取目标物体的候选框，得到可见光图像候选框和红外热图像候选框；The candidate frame extraction network is used to extract the candidate frame of the target object according to the visible light image feature map and the infrared thermal image feature map, and obtain the visible light image candidate frame and the infrared thermal image candidate frame;

候选框互补模块，用于将可见光图像候选框中没有被红外热图像候选框覆盖到的部分添加到红外热图像候选框中，将红外热图像候选框中没有被可见光图像候选框覆盖到的部分添加到可见光图像候选框中，得到可见光图像区域特征图和红外热图像区域特征图；The candidate frame complementary module is used to add the part of the visible light image candidate frame that is not covered by the infrared thermal image candidate frame to the infrared thermal image candidate frame, and add the part of the infrared thermal image candidate frame that is not covered by the visible light image candidate frame Add it to the visible light image candidate box to obtain the visible light image region feature map and the infrared thermal image region feature map;

跨模态注意力融合模块，用于将红外热图像区域特征图作为查询向量，将可见光图像区域特征图作为钥匙向量和价值向量，参照自注意力模块将可见光图像区域特征图根据各区域特征间的相似关系融合到红外热图像区域特征图中，得到经过彩色图像特征加强了的热图像特征；The cross-modal attention fusion module is used to use the infrared thermal image region feature map as the query vector, the visible light image region feature map as the key vector and the value vector, and refer to the self-attention module. The similarity relationship of the infrared thermal image is fused into the feature map of the infrared thermal image area, and the thermal image feature enhanced by the color image feature is obtained;

分类和回归模块，用于将经过彩色图像特征加强了的热图像特征以及可见光图像区域特征图进行卷积计算，得到目标检测结果，其中，目标检测结果包括：各区域的类别和候选框偏移量。The classification and regression module is used to convolve the thermal image features enhanced by the color image features and the visible light image region feature maps to obtain the target detection results, wherein the target detection results include: the category of each region and the candidate frame offset quantity.

可见，本发明提供了一种全天候、端到端、结合了可见光彩色图像和红外热图像的多模态/多光谱目标检测导盲系统。本发明无需位置偏移监督信息、在区域特征水平通过自注意力模块聚合两种模态信息的端到端目标检测系统。本发明中的模型以两阶段检测算法Faster-RCNN为基础，特征提取和区域候选网络(RPN)阶段有两个独立分支，分别用于提取可见光和红外光图像的区域特征，然后通过候选框互补模块和跨模态自注意力模块融合两个分支的区域特征，最后进行区域的类别和坐标预测。具体的实施例中，可以使用FPN，RFCN等通用两阶段检测模型作为基础模型，并不局限于Faster-RCNN。It can be seen that the present invention provides an all-weather, end-to-end, multimodal/multispectral target detection guidance system that combines visible light color images and infrared thermal images. The present invention does not need position offset supervision information, and is an end-to-end target detection system that aggregates two modal information through a self-attention module at the regional feature level. The model in the present invention is based on the two-stage detection algorithm Faster-RCNN. The feature extraction and region candidate network (RPN) stages have two independent branches, which are respectively used to extract the regional features of the visible light and infrared light images, and then complement each other through the candidate frame. The module and the cross-modal self-attention module fuse the regional features of the two branches, and finally perform the category and coordinate prediction of the region. In a specific embodiment, a general two-stage detection model such as FPN and RFCN can be used as the basic model, which is not limited to Faster-RCNN.

以下，结合图2和图3，对本发明实施例提供的多光谱目标检测导盲系统进行详细说明：Hereinafter, with reference to FIG. 2 and FIG. 3 , the multispectral target detection guidance system provided by the embodiment of the present invention will be described in detail:

作为本发明实施例的一个可选实施方式，数据输入模块，还用于确定训练目标的类别和位置。从而可以在训练中将学习目标输入给检测网络，进行模型训练。As an optional implementation manner of the embodiment of the present invention, the data input module is further configured to determine the category and position of the training target. In this way, the learning target can be input to the detection network during training, and the model can be trained.

具体地：specifically:

数据输入模块：神经网络的数据输入包括两大部分，第一是图像输入，第二是检测目标输入。其中图像输入是待检测的双模态配对图像，一是可见光彩色图像(为RGB三通道)，二是红外热图像(为灰度图，原始数据只有一个通道)。假设输入图像的长宽为H,W，则输入网络的图像为[N×3×H×W,N×1×H×W](在一些常见的实现中，也可以把红外热图像的通道复制三遍，得到三通道的输入，即N×3×H×W)，其中N代表batch size。检测目标输入则是原始图像中被标记出来的目标物体的类别和位置，位置以目标物体外接矩形框的坐标点[x1,y1,x2,y2]来表示，其中[x1,y1]和[x2,y2]分别是外接框左上角和右下角的坐标。检测目标首先通过人为标记，在训练中将作为学习目标输入给检测网络，用于模型训练。Data input module: The data input of the neural network includes two parts, the first is the image input, and the second is the detection target input. The image input is the dual-modal paired image to be detected, one is a visible light color image (with three RGB channels), and the other is an infrared thermal image (a grayscale image, with only one channel of raw data). Assuming that the length and width of the input image are H, W, the image input to the network is [N×3×H×W, N×1×H×W] (in some common implementations, the channel of the infrared thermal image can also be Copy three times to get a three-channel input, that is, N×3×H×W), where N represents the batch size. The detection target input is the category and position of the target object marked in the original image, and the position is represented by the coordinate points [x1, y1, x2, y2] of the bounding rectangle of the target object, where [x1, y1] and [x2] ,y2] are the coordinates of the upper left corner and the lower right corner of the bounding box, respectively. The detection target is first marked manually, and is input to the detection network as a learning target during training for model training.

作为本发明实施例的一个可选实施方式，可变形特征提取器模块包括：第一可变形特征提取器，用于提取可见光图像的图像特征，得到可见光图像特征图；第二可变形特征提取器，用于提取红外热图像的图像特征，得到红外热图像特征图；可见光图像特征图和红外热图像特征图大小相同。As an optional implementation of the embodiment of the present invention, the deformable feature extractor module includes: a first deformable feature extractor, configured to extract image features of a visible light image to obtain a visible light image feature map; a second deformable feature extractor , which is used to extract the image features of the infrared thermal image to obtain the infrared thermal image feature map; the visible light image feature map and the infrared thermal image feature map have the same size.

作为本发明实施例的一个可选实施方式，可变形卷积公式为：

其中，yp＝1K wk·xp+pk为常规卷积操作公式，x表示输入特征图，y表示输出特征图，p是当前待计算的像素点位置(w₀,h₀)，k表示卷积范围内的位置序号，p_k是相对p的位置偏移，w_k表示k点位置所对应的权重，Δp_k表示卷积中k点额外增加的位置偏移量，Δm_k表示卷积中k点的额外权重。As an optional implementation of the embodiment of the present invention, the deformable convolution formula is:

Among them, yp=1K wk xp+pk is the conventional convolution operation formula, x represents the input feature map, y represents the output feature map, p is the current pixel position to be calculated (w ₀ , h ₀ ), k represents the convolution The position number in the range, p _k is the position offset relative to p, w _k represents the weight corresponding to the k point position, Δp _k represents the additional position offset of the k point in the convolution, Δm _k represents the k in the convolution Additional weight for points.

作为本发明实施例的一个可选实施方式，第一可变形特征提取器和第二可变形特征提取器分别独立学习w_k、Δp_k和Δm_k；或者第一可变形特征提取器和第二可变形特征提取器分别独立学习w_k，共享学习Δp_k和Δm_k。As an optional implementation of the embodiment of the present invention, the first deformable feature extractor and the second deformable feature extractor learn w _k , Δp _k and Δm _k independently; or the first deformable feature extractor and the second deformable feature extractor The deformable feature extractors learn w _k independently and share Δp _k and Δm _{k respectively} .

具体地：specifically:

可变形特征提取器模块：由于两种模态的图像输入存在位置偏移，因此本发明在特征提取模块中使用可变形卷积(deformable convolution)，让两个分支网络在独立提取图像特征的过程中，隐式地实现特征水平的对齐。在目前主流的特征提取器如Resnet50中，卷积使用的都是几何结构相对固定的卷积核，如成正方形的3×3或7×7卷积核，他们的几何变换建模能力本质上是有限的。可变形卷积在常规卷积上，添加了位移变量，该位移会在模型训练中自动学到，偏移后卷积的感受野不再是正方形，而是根据训练数据的情况变成了任意多边形。针对多模态图像存在的位置偏移，本发明的两个特征提取器可以经过训练自动调整各自的可变形卷积，使提取的特征在卷积感受野的层面实现对齐。同时，本方法不用提供额外的监督信息来实现两种模态图像的校准，因此可节省成本，也易于实现。以数据输入模块中的两种模态图像为输入，两个特征提取器最终输出的特征图大小都为N×C×H’×W’，其中H’＝H/16，W’＝W/16,C代表特征维度或通道数，比如512或2048。Deformable feature extractor module: Since the image input of the two modalities has a position offset, the present invention uses a deformable convolution in the feature extraction module, allowing the two branch networks to independently extract image features. , feature-level alignment is implicitly implemented. In the current mainstream feature extractors such as Resnet50, convolutions use convolution kernels with relatively fixed geometric structures, such as square 3×3 or 7×7 convolution kernels. Their geometric transformation modeling capabilities are essentially is limited. The deformable convolution adds a displacement variable to the conventional convolution, and the displacement is automatically learned during model training. After the offset, the receptive field of the convolution is no longer square, but becomes arbitrary according to the training data. polygon. In view of the positional offset existing in the multimodal image, the two feature extractors of the present invention can automatically adjust their respective deformable convolutions through training, so that the extracted features can be aligned at the level of the convolution receptive field. At the same time, the method does not need to provide additional supervision information to achieve the calibration of the two modal images, so it can save costs and be easy to implement. Taking the two modal images in the data input module as input, the final output feature map size of the two feature extractors is N×C×H'×W', where H'=H/16, W'=W/ 16, C represents the feature dimension or the number of channels, such as 512 or 2048.

常规卷积操作如公式(1)，x表示输入特征图，y表示输出特征图，p是当前待计算的像素点位置(w₀,h₀)，k表示卷积范围内的位置序号，例如3×3卷积中k＝9；p_k是相对p的位置偏移，w_k表示k点位置所对应的权重，是可学习参数。可变形卷积如公式(2)，它在公式(1)的基础上增加了两个可学习参数，Δp_k和Δm_k。Δp_k表示卷积中k点额外增加的位置偏移量，Δm_k表示卷积中k点的额外权重。作为一个补充解释，本发明提供两种可变形卷积网络的实现形式，作为实施案例。第一种，可见光图像和热红外图像的特征提取网络完全相互独立，可变形卷积的变形参数由各自的特征提取网络学习得到，即两个分支网络中可变形卷积的可学习参数，w_k、Δp_k、Δm_k都是独立的。第二种，可见光图像和热红外图像的特征提取网络部分独立，即特征提取的主要计算是相互独立的，即w_k的学习是相互独立的。但是可变形卷积中的变形参数Δp_k和Δm_k是共享的，具体的后面两个变形参数由两种模态特征的融合特征作为输入学习得到。The conventional convolution operation is as formula (1), x represents the input feature map, y represents the output feature map, p is the current pixel position to be calculated (w ₀ , h ₀ ), k represents the position number within the convolution range, for example In the 3×3 convolution, k=9; p _k is the position offset relative to p, and w _k represents the weight corresponding to the position of k point, which is a learnable parameter. Deformable convolution is shown in formula (2), which adds two learnable parameters, Δp _k and Δm _k , on the basis of formula (1). Δp _k represents the additional position offset of k points in the convolution, and Δm _k represents the additional weight of k points in the convolution. As a supplementary explanation, the present invention provides two implementation forms of deformable convolutional networks as implementation examples. The first is that the feature extraction networks of the visible light image and the thermal infrared image are completely independent of each other, and the deformation parameters of the deformable convolution are learned from the respective feature extraction networks, that is, the learnable parameters of the deformable convolution in the two branch networks, w _k , Δp _k , Δm _k are all independent. Second, the feature extraction networks of visible light images and thermal infrared images are partially independent, that is, the main computation of feature extraction is independent of each other, that is, the learning of w _k is independent of each other. However, the deformation parameters Δp _k and Δm _k in the deformable convolution are shared, and the specific latter two deformation parameters are learned from the fusion features of the two modal features as input.

作为本发明实施例的一个可选实施方式，候选框提取网络包括：第一候选框提取网络，用于连接第一可变形特征提取器，提取可见光图像特征图中存在物体的可见光图像候选框；第二候选框提取网络，用于连接第二可变形特征提取器，提取红外热图像特征图中存在物体的红外热图像候选框。As an optional implementation of the embodiment of the present invention, the candidate frame extraction network includes: a first candidate frame extraction network, configured to connect to the first deformable feature extractor, and extract the visible light image candidate frame of the object in the visible light image feature map; The second candidate frame extraction network is used to connect the second deformable feature extractor to extract the infrared thermal image candidate frame of the object existing in the infrared thermal image feature map.

具体地：specifically:

候选框提取网络：以特征提取器输出的特征图为输入，候选框提取网络模块旨在提取存在物体的候选框——对目标物体真实外接矩形框的预测，而不管其中的物体具体属于哪一类。具体的，针对特征图中的每一个像素点生成k个大小各异的锚框(anchors)，然后将这k个锚框内的特征图输入提取网络，网络经过计算预测出每个锚框存在物体的概率——k×2个分类结果，以及锚框相对于物体真实位置的偏移量——k×4个回归结果。对于大小为N×C×H’×W’的特征图，提取网络将输出N×H’×W’×k×2+N×H’×W’×k×4个结果。最终，经过非极大值抑制以及剔除，选出最可能存在物体的M(通常M＝1024)个候选框，以大小为M×4的矩阵存储。两个分支网络会基于各自的特征图分别输出M个相对独立的候选框。The candidate frame extraction network: Taking the feature map output by the feature extractor as input, the candidate frame extraction network module aims to extract the candidate frame of the object - the prediction of the true bounding rectangle of the target object, regardless of which object it belongs to. kind. Specifically, k anchor boxes of different sizes are generated for each pixel in the feature map, and then the feature maps in the k anchor boxes are input into the extraction network, and the network predicts that each anchor box exists after calculation. The probability of the object - k × 2 classification results, and the offset of the anchor box relative to the true position of the object - k × 4 regression results. For a feature map of size N×C×H’×W’, the extraction network will output N×H’×W’×k×2+N×H’×W’×k×4 results. Finally, after non-maximum suppression and culling, M (usually M=1024) candidate frames with the most likely objects are selected and stored in a matrix of size M×4. The two branch networks will output M relatively independent candidate boxes based on their respective feature maps.

作为本发明实施例的一个可选实施方式，候选框互补模块，具体用于将可见光图像候选框中没有被红外热图像候选框覆盖到的部分添加到红外热图像候选框中，将红外热图像候选框中没有被可见光图像候选框覆盖到的部分添加到可见光图像候选框中，根据选定的候选框，将初始特征图上对应位置大小各异的区域特征提取出来，经过区域池化层将各区域特征统一到相同尺寸，得到尺寸相同的可见光图像区域特征图和红外热图像区域特征图。As an optional implementation of the embodiment of the present invention, the candidate frame complementary module is specifically configured to add the part of the visible light image candidate frame that is not covered by the infrared thermal image candidate frame to the infrared thermal image candidate frame, and the infrared thermal image The part of the candidate frame that is not covered by the visible light image candidate frame is added to the visible light image candidate frame. According to the selected candidate frame, the regional features of different positions and sizes on the initial feature map are extracted. The features of each region are unified to the same size to obtain the region feature map of visible light image and the region feature map of infrared thermal image with the same size.

具体地：specifically:

候选框互补模块：本模块意在融合两个分支网络提取的候选框，以期在目标物体位置水平实现两种模态的互补。针对照明条件不好的情况，彩色图像分支提取的候选框就可能出现遗漏，而热图像分支提取的候选框则相对稳定；此外，也存在热图像分支出现遗漏但彩色图像能检出的情况，如阴天里温度低的电线杆。本模块利用两个模态获取了更加完备的候选框。具体的，本模块以前一阶段得到的两种模态的候选框为输入，将两个模态的候选框中IoU小于p∈[0.5,0.8]的部分，如m个，添加到另一个模态的候选框中，两个模态各自的候选框数量增加到M’＝M+m个，根据具体实施例的不同，阈值p的取值可以稍有变化。根据选定的候选框，将初始特征图上对应位置大小各异的区域特征提取出来，然后经过区域池化层(ROI pooling)将各区域特征统一到相同尺寸L×L(如L＝7)。最终，该模块将分别输出两个模态的M’个区域特征图，其大小为2×N×M’×C×L×L。Candidate frame complementation module: This module is intended to fuse the candidate frames extracted by the two branch networks, in order to achieve the complementarity of the two modalities at the target object position level. In the case of poor lighting conditions, the candidate frame extracted by the color image branch may be missing, while the candidate frame extracted by the thermal image branch is relatively stable; in addition, there are also cases where the thermal image branch is missed but the color image can be detected. Such as electricity poles with low temperature on cloudy days. This module uses two modalities to obtain more complete candidate boxes. Specifically, this module takes the candidate boxes of the two modalities obtained in the previous stage as input, and adds the parts of the candidate boxes of the two modalities whose IoU is smaller than p∈[0.5, 0.8], such as m, to another module. The number of candidate boxes for each of the two modalities is increased to M'=M+m, and the value of the threshold p may vary slightly according to the specific embodiment. According to the selected candidate frame, the regional features of different corresponding positions and sizes on the initial feature map are extracted, and then the regional features are unified to the same size L×L (such as L=7) through the region pooling layer (ROI pooling). . Finally, the module will output M' regional feature maps of the two modalities, respectively, whose size is 2×N×M’×C×L×L.

作为本发明实施例的一个可选实施方式，跨模态注意力融合模块，具体用于将红外热图像区域特征图和可见光图像区域特征图经过独立的卷积进行降维，计算红外热图像区域特征图和可见光图像区域特征图中各区域特征之间的俩俩相似关系，得到关系矩阵，对相似度做权值归一化，可见光图像区域特征图中特征经过卷积，与关系矩阵的矩阵乘，输出双模态互补增强区域特征，得到经过彩色图像特征加强了的热图像特征。As an optional implementation of the embodiment of the present invention, the cross-modal attention fusion module is specifically used to reduce the dimension of the infrared thermal image region feature map and the visible light image region feature map through independent convolution, and calculate the infrared thermal image region The feature map and the two similar relationships between the features of each region in the feature map of the visible light image area are obtained, the relationship matrix is obtained, the weights are normalized for the similarity, the features in the feature map of the visible light image region are convolved, and the matrix of the relationship matrix is obtained. Multiply, output the dual-modal complementary enhanced region features, and obtain the thermal image features enhanced by the color image features.

具体地：specifically:

跨模态注意力融合模块：为了特征模型融合的效果，本发明引入一种双向的特征增强模块。由于不同模态对环境的成像机理不同，点对点的特征融合很难提供实质性的帮助。比如，暗光条件下的行人在彩色图像上大部分都是漆黑不可见的(整个上半生)，仅有少部分有光照(小腿以下)，那么彩色图像分支网络从这些点提取特征就并不具有鉴别力，根据位置坐标一一融合到热图像的特征中也并不能发挥多大的互补或增强作用。而区域水平的融合相对来说是更有效的方法。具体来说，在候选框互补模块的帮助下，彩色图分支也能获取上述例子中大部分不可见的行人的候选框，对应的区域特征包含了小部分有光照的小腿的信息，这种仅含有物体某一部分的特征能作为更好的补充信息被融合到热图像特征中，增强后者的表征能力。因此，本模块采用了在区域特征水平进行融合的策略。Cross-modal attention fusion module: For the effect of feature model fusion, the present invention introduces a bidirectional feature enhancement module. Due to the different imaging mechanisms of different modalities on the environment, point-to-point feature fusion is difficult to provide substantial help. For example, pedestrians in dark light conditions are mostly dark and invisible on the color image (the entire upper half of life), and only a small part is illuminated (below the calf), then the color image branch network extracts features from these points. It has discriminative power, and it does not play much complementary or enhanced role in the features of thermal images that are fused one by one according to the position coordinates. The regional level fusion is relatively more effective method. Specifically, with the help of the candidate frame complementary module, the color map branch can also obtain the candidate frames of most of the invisible pedestrians in the above example, and the corresponding regional features contain the information of a small part of the illuminated lower legs. Features containing a certain part of an object can be fused into thermal image features as better supplementary information, enhancing the latter's representational power. Therefore, this module adopts a fusion strategy at the regional feature level.

此外，一个场景中的不同物体往往具有一定的依赖关系，这种物体间关系可以帮助提高模型的表征能力，进而提高预测精度。比如，当斑马线上出现一个人的时候，往往会有另外一些人(因为绿灯)；公园中长椅或者垃圾桶一般安放在草坪外，而不是草坪上。因此，本模块在融合两种模态区域特征的过程中，还建模了各个区域/潜在物体之间的关系，并利用这种物体间的关系来增强区域特征的表达。In addition, different objects in a scene often have certain dependencies, and this inter-object relationship can help improve the representation ability of the model, thereby improving the prediction accuracy. For example, when one person appears on a zebra crossing, there are often other people (because of the green light); park benches or trash cans are generally placed outside the lawn, not on the lawn. Therefore, in the process of fusing the regional features of the two modalities, this module also models the relationship between various regions/potential objects, and uses this relationship between objects to enhance the expression of regional features.

具体的，本模块以红外光热图像区域特征和可见光彩色图像区域特征为输入，将前者作为查询向量(query)，将后者作为钥匙向量(key)和价值向量(value)，参照自注意力模块将后者根据各区域特征间的相似关系融合到前者中，从而实现可见光特征对红外光特征增强的效果，或者说实现双光特征互补的效果，如图2。为了方便描述取N＝1，先将热图像区域特征和彩色图像区域特征分别经过独立的L×L卷积进行相同的降维，变成M’×C×1×1的大小query和key；然后计算两种模态中各区域特征之间的俩俩相似关系，得到大小为M’×M’关系矩阵，计算相似度的方法可以取欧氏距离的负数、矩阵乘或其他，之后对相似度做权值归一化，实施例中默认使用逐行softmax运算；同时，彩色图像区域特征还会经过一个1×1卷积变成M’×C×L×L大小的value，最后经过与关系矩阵的矩阵乘，输出大小为M’×C×L×L的双模态互补增强区域特征，即在区域特征水平得到经过彩色图像特征加强了的热图像特征。Specifically, this module takes the infrared photothermal image area feature and the visible light color image area feature as input, the former as the query vector (query), the latter as the key vector (key) and the value vector (value), referring to self-attention The module fuses the latter into the former according to the similar relationship between the features of each region, so as to realize the enhancement effect of the visible light feature on the infrared light feature, or realize the complementary effect of the dual light feature, as shown in Figure 2. In order to facilitate the description, take N=1, first, the thermal image area feature and the color image area feature are respectively subjected to the same dimension reduction through independent L×L convolution, and become M'×C×1×1 size query and key; Then calculate the similarity relationship between the two regional features in the two modalities, and obtain a relationship matrix of size M'×M'. The method of calculating the similarity can take the negative number of the Euclidean distance, matrix multiplication or other, and then compare the similarity The degree of weight is normalized. In the embodiment, the line-by-line softmax operation is used by default; at the same time, the color image area feature will also undergo a 1×1 convolution to become a value of size M'×C×L×L, and finally pass through and The matrix multiplication of the relation matrix outputs the bimodal complementary enhanced regional features of size M'×C×L×L, that is, the thermal image features enhanced by the color image features are obtained at the regional feature level.

本模块建模的是候选区域之间的关系，而不是原始方法中所有像素点间的关系，因此计算复杂度相对小很多(前者O(M’×M’)～10⁶，后者O(H×W×H×W)～10⁸)，计算效率高。This module models the relationship between candidate regions, not the relationship between all pixels in the original method, so the computational complexity is relatively small (the former O(M'×M')～10 ⁶ , the latter O( H×W×H×W)～10 ⁸ ), the calculation efficiency is high.

作为本发明实施例的一个可选实施方式，跨模态注意力融合模块，还有将经过彩色图像特征加强了的热图像特征与红外热图像区域特征图相加或合并；分类和回归模块，还用于将经过彩色图像特征加强了的热图像特征与红外热图像区域特征图相加或合并的特征图以及可见光图像区域特征图进行卷积计算，得到目标检测结果，其中，目标检测结果包括：各区域的类别和候选框偏移量。及跨模态注意力融合模块的输出可与热图像区域特征进行相加或合并(concatenation)。As an optional implementation of the embodiment of the present invention, the cross-modal attention fusion module also adds or merges the thermal image features enhanced by the color image features and the infrared thermal image region feature map; the classification and regression module, It is also used to perform convolution calculation on the thermal image feature enhanced by the color image feature, the feature map obtained by adding or combining the infrared thermal image region feature map, and the visible light image region feature map to obtain target detection results, wherein the target detection results include : The category and candidate frame offset of each region. And the output of the cross-modal attention fusion module can be added or concatenated with the thermal image region features.

作为本发明实施例的一个可选实施方式，系统还包括：损失计算模块，用于根据目标检测结果和训练目标，采用损失函数计算出模型在框回归和框分类两个任务中的综合预测误差，回传误差的梯度，更新模型参数，进行模型训练，不断迭代，模型的预测误差将不断下降直到收敛，得到可应用部署的模型。As an optional implementation of the embodiment of the present invention, the system further includes: a loss calculation module, configured to use a loss function to calculate the comprehensive prediction error of the model in the two tasks of frame regression and frame classification according to the target detection result and the training target , return the gradient of the error, update the model parameters, perform model training, and iterate continuously. The prediction error of the model will continue to decrease until convergence, and a model that can be applied and deployed will be obtained.

具体地：specifically:

损失计算模块：该模块以整个模型的预测结果和对应训练目标为输入，利用常规的损失函数计算出模型在框回归和框分类两个任务中的综合预测误差，然后回传误差的梯度，根据一定的学习率更新整个模型的参数，实现模型训练。如此不断迭代，模型的预测误差将不断下降直到收敛，最终得到一个可应用部署的模型。训练中可以对网络参数使用混合精度训练(Mixed Precision Training)，从而达到降低显存和加快训练速度的目的。Loss calculation module: This module takes the prediction result of the entire model and the corresponding training target as input, uses the conventional loss function to calculate the comprehensive prediction error of the model in the two tasks of box regression and box classification, and then returns the gradient of the error, according to A certain learning rate updates the parameters of the entire model to achieve model training. In this way, the prediction error of the model will continue to decrease until convergence, and finally a model that can be applied and deployed will be obtained. During training, you can use Mixed Precision Training for network parameters to reduce video memory and speed up training.

由此可见，通过本发明实施例提供的多光谱目标检测导盲系统，利用多光谱图像构建基于目标检测网络的全天候导盲系统，在特征提取器中应用可变形卷积隐式学习多光谱图像的对齐关系，应对可能出现的位置偏移，提出候选框互补模块和跨模态注意力模块全面充分地利用多光谱图像的互补信息，实现更加精准的全天候目标检测。同时进一步增强对特征不对齐问题的鲁棒性。由此可以支持全天候导盲，隐式学习多光谱图像对齐关系，无需额外标注，节省成本，候选框互补模块和跨模态注意力模块对多光谱图像的互补信息利用更加充分，融合更加有效，效果更好。此外，跨模态注意力模块计算复杂度低，效率高。It can be seen that, through the multi-spectral target detection guidance system provided by the embodiment of the present invention, an all-weather guidance system based on the target detection network is constructed by using multi-spectral images, and the deformable convolution is applied in the feature extractor to implicitly learn the multi-spectral images. In order to cope with the possible position offset, the candidate frame complementary module and the cross-modal attention module are proposed to fully utilize the complementary information of multispectral images to achieve more accurate all-weather target detection. At the same time, the robustness to feature misalignment problem is further enhanced. This can support all-weather guidance for blindness, implicitly learn the alignment relationship of multispectral images without additional annotation, save costs, and make full use of the complementary information of multispectral images by the candidate frame complementary module and the cross-modal attention module, and the fusion is more effective. Better results. In addition, the cross-modal attention module has low computational complexity and high efficiency.

以上仅为本申请的实施例而已，并不用于限制本申请。对于本领域技术人员来说，本申请可以有各种更改和变化。凡在本申请的精神和原理之内所作的任何修改、等同替换、改进等，均应包含在本申请的权利要求范围之内。The above are merely examples of the present application, and are not intended to limit the present application. Various modifications and variations of this application are possible for those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of this application shall be included within the scope of the claims of this application.

Claims

1. A multi-spectral target detection blind guide system, comprising:

the data input module is used for acquiring a visible light image and an infrared thermal image;

the deformable feature extractor module is used for respectively extracting the image features of the visible light image and the infrared thermal image by adopting deformable convolution and outputting a visible light image feature map and an infrared thermal image feature map;

the candidate frame extraction network is used for extracting candidate frames of the target object according to the visible light image characteristic diagram and the infrared thermal image characteristic diagram to obtain a visible light image candidate frame and an infrared thermal image candidate frame;

a candidate frame complementing module, configured to add, to the infrared thermal image candidate frame, a portion of the visible light image candidate frame that is not covered by the infrared thermal image candidate frame, and add, to the visible light image candidate frame, a portion of the infrared thermal image candidate frame that is not covered by the visible light image candidate frame, so as to obtain a visible light image area feature map and an infrared thermal image area feature map;

the cross-mode attention fusion module is used for taking the infrared thermal image region feature map as a query vector, taking the visible light image region feature map as a key vector and a value vector, and fusing the visible light image region feature map into the infrared thermal image region feature map according to the similarity relation among the region features by referring to the self-attention module to obtain thermal image features enhanced by color image features;

a classification and regression module, configured to perform convolution calculation on the thermal image features enhanced by the color image features and the visible light image region feature map to obtain a target detection result, where the target detection result includes: the category of each region and the candidate box offset.

2. The system of claim 1,

the data input module is also used for determining the category and the position of a training target;

the system further comprises:

and the loss calculation module is used for calculating the comprehensive prediction error of the model in two tasks of frame regression and frame classification by adopting a loss function according to the target detection result and the training target, returning the gradient of the error, updating the model parameters, training the model, and continuously iterating until the prediction error of the model is continuously reduced until the model is converged to obtain the model capable of being applied and deployed.

3. The system of claim 1, wherein the deformable feature extractor module comprises:

the first deformable feature extractor is used for extracting image features of the visible light image to obtain a visible light image feature map;

the second deformable feature extractor is used for extracting image features of the infrared thermal image to obtain an infrared thermal image feature map;

the visible light image feature map and the infrared thermal image feature map are the same in size.

4. The system of claim 3, wherein the deformable convolution formula is:

wherein,

for the conventional convolution operation formula, x represents the input feature map, y represents the output feature map, and p is the current pixel point position to be calculated (w)₀,h₀) K denotes the position number in the convolution range, p_kIs a positional shift with respect to p, w_kRepresenting the weight, Δ p, corresponding to the position of the k point_kRepresenting an additional added position offset, Δ m, of k points in the convolution_kRepresenting the additional weight of the k points in the convolution.

5. The system of claim 4, wherein the first deformable feature extractor and the second deformable feature extractor each independently learn w_k、Δp_kAnd Δ m_k(ii) a Or the first and second deformable feature extractors learn w independently_kShared learning Δ p_kAnd Δ m_k。

6. The system of claim 3, wherein the candidate box extraction network comprises:

a first candidate frame extraction network, connected to the first deformable feature extractor, for extracting the visible light image candidate frame in which an object exists in the visible light image feature map;

and the second candidate frame extraction network is connected with the second deformable feature extractor and is used for extracting the infrared thermal image candidate frames of the objects in the infrared thermal image feature map.

7. The system according to claim 1, wherein the candidate frame complementing module is specifically configured to add a portion of the visible-light image candidate frame that is not covered by the infrared thermal image candidate frame to the infrared thermal image candidate frame, add a portion of the infrared thermal image candidate frame that is not covered by the visible-light image candidate frame to the visible-light image candidate frame, extract, according to the selected candidate frame, the region features at different sizes of the corresponding positions on the initial feature map, and unify the region features to the same size through a region pooling layer, thereby obtaining the visible-light image region feature map and the infrared thermal image region feature map that have the same size.

8. The system of claim 1,

the cross-modal attention fusion module is specifically configured to perform dimension reduction on the infrared thermal image region feature map and the visible light image region feature map through independent convolution, calculate two similarity relations between the infrared thermal image region feature map and each region feature in the visible light image region feature map, obtain a relation matrix, normalize the similarity by a weight, perform convolution on the features in the visible light image region feature map, multiply the convolved features with the matrix of the relation matrix, output bimodal complementary enhanced region features, and obtain the thermal image features enhanced by the color image features.

9. The system of claim 1,

the cross-modal attention fusion module is used for adding or merging the thermal image characteristics enhanced by the color image characteristics and the infrared thermal image area characteristic map;

the classification and regression module is further configured to perform convolution calculation on the thermal image features enhanced by the color image features, the feature map obtained by adding or merging the infrared thermal image region feature map and the visible light image region feature map to obtain a target detection result, where the target detection result includes: the category of each region and the candidate box offset.