CN117036890A

CN117036890A - Training of pedestrian detection model, pedestrian detection method, device, equipment and medium

Info

Publication number: CN117036890A
Application number: CN202311062534.2A
Authority: CN
Inventors: 李明月; 龚向锋; 崔文朋; 李长柏; 田志仲; 聂玉虎; 王春冬; 霍磊; 张桂庆; 孟颖出; 于秀丽; 李春晖
Original assignee: Beijing Smartchip Microelectronics Technology Co Ltd
Current assignee: Beijing Smartchip Microelectronics Technology Co Ltd
Priority date: 2023-08-22
Filing date: 2023-08-22
Publication date: 2023-11-10

Abstract

The invention discloses a method, a device, equipment and a medium for training a pedestrian detection model and detecting pedestrians. Obtaining a visible light sample image and a thermal infrared sample image aiming at a target pedestrian scene; and performing element-by-element addition operation on the characteristics of the visible light sample image and the characteristics of the thermal infrared sample image to obtain the multi-mode fusion image characteristics. And carrying out feature reconstruction based on the multi-mode fusion image features to obtain visible light reconstruction features and thermal infrared reconstruction features. Determining a first single-mode similarity loss between the visible light extraction feature and the visible light reconstruction feature, a second single-mode similarity loss between the thermal infrared extraction feature and the thermal infrared reconstruction feature, a first multi-mode interaction loss between the thermal infrared reconstruction feature and the visible light extraction feature, and a second multi-mode interaction loss between the visible light reconstruction feature and the thermal infrared extraction feature. And training the initial pedestrian detection model according to the similarity loss and the interaction loss until the model training stopping condition is met, so as to obtain the target pedestrian detection model.

Description

Training of pedestrian detection models, pedestrian detection methods, devices, equipment and media

技术领域Technical field

本发明涉及人工智能技术领域，尤其涉及一种行人检测模型的训练、行人检测方法、装置、设备及介质。The present invention relates to the field of artificial intelligence technology, and in particular to the training of a pedestrian detection model, pedestrian detection methods, devices, equipment and media.

背景技术Background technique

行人检测模型可以识别和跟踪图像或视频中出现的行人，从而提高交通安全、监控和搜索救援等方面的效率和准确性。Pedestrian detection models can identify and track pedestrians appearing in images or videos, thereby improving efficiency and accuracy in traffic safety, surveillance, and search and rescue.

相关技术中，可以使用多模态特征融合方法(比如对不同模态特征进行拼接的方式)进行行人的检测。然而，多模态特征间的相互干扰会影响行人检测的准确率。In related technologies, multi-modal feature fusion methods (such as splicing different modal features) can be used to detect pedestrians. However, the mutual interference between multi-modal features will affect the accuracy of pedestrian detection.

发明内容Contents of the invention

本说明书实施方式旨在至少在一定程度上解决相关技术中的技术问题之一。为此，本说明书实施方式提出一种行人检测模型的训练、行人检测方法、装置、设备及介质。The embodiments of this specification are intended to solve one of the technical problems in the related art, at least to a certain extent. To this end, the embodiments of this specification propose a pedestrian detection model training, pedestrian detection method, device, equipment and medium.

本说明书实施方式提供一种行人检测模型的训练方法，所述方法包括：The embodiment of this specification provides a training method for a pedestrian detection model. The method includes:

获取针对目标行人场景的可见光样本图像和热红外样本图像；Obtain visible light sample images and thermal infrared sample images for the target pedestrian scene;

基于所述可见光样本图像和所述热红外样本图像间的多模态融合图像特征进行特征重建，得到可见光重建特征和热红外重建特征；其中，所述多模态融合图像特征是基于所述可见光样本图像的特征和所述热红外样本图像的特征进行逐元素相加操作得到的；Feature reconstruction is performed based on the multi-modal fusion image features between the visible light sample image and the thermal infrared sample image to obtain visible light reconstruction features and thermal infrared reconstruction features; wherein the multi-modal fusion image features are based on the visible light The characteristics of the sample image and the characteristics of the thermal infrared sample image are obtained by performing an element-by-element addition operation;

确定所述可见光样本图像的可见光提取特征与所述可见光重建特征间的第一单模态相似损失、所述热红外样本图像的热红外提取特征与所述热红外重建特征间的第二单模态相似损失、所述热红外重建特征与所述可见光提取特征间的第一多模态交互损失、所述可见光重建特征与所述热红外提取特征间的第二多模态交互损失；Determining a first single-mode similarity loss between the visible light extraction feature of the visible light sample image and the visible light reconstruction feature, and a second single-mode similarity loss between the thermal infrared extraction feature of the thermal infrared sample image and the thermal infrared reconstruction feature. State similarity loss, the first multi-modal interaction loss between the thermal infrared reconstruction feature and the visible light extraction feature, the second multi-modal interaction loss between the visible light reconstruction feature and the thermal infrared extraction feature;

根据所述第一单模态相似损失、所述第二单模态相似损失、所述第一多模态交互损失以及所述第二多模态交互损失，对初始行人检测模型同时进行单模态监督训练和跨模态监督训练，直至满足模型停止训练条件，得到目标行人检测模型。According to the first single-modal similarity loss, the second single-modal similarity loss, the first multi-modal interaction loss and the second multi-modal interaction loss, the initial pedestrian detection model is simultaneously single-modal. Modal supervised training and cross-modal supervised training are performed until the model stop training conditions are met and the target pedestrian detection model is obtained.

在其中一个实施方式，所述初始行人检测模型包括并联的第一编码网络和第二编码网络，所述第一编码网络和第二编码网络共同连接至融合组件，所述融合组件连接有并联的第一解码网络和第二解码网络；所述基于所述可见光样本图像和所述热红外样本图像间的多模态融合图像特征进行特征重建，得到可见光重建特征和热红外重建特征，包括：In one embodiment, the initial pedestrian detection model includes a parallel first encoding network and a second encoding network, the first encoding network and the second encoding network are jointly connected to a fusion component, and the fusion component is connected to a parallel a first decoding network and a second decoding network; performing feature reconstruction based on multi-modal fusion image features between the visible light sample image and the thermal infrared sample image to obtain visible light reconstruction features and thermal infrared reconstruction features, including:

将所述可见光样本图像输入至所述第一编码网络中进行特征提取，得到可见光模态特征；Input the visible light sample image into the first encoding network for feature extraction to obtain visible light modal features;

将所述热红外样本图像输入至所述第二编码网络中进行特征提取，得到热红外模态特征；Input the thermal infrared sample image into the second encoding network for feature extraction to obtain thermal infrared modal features;

通过所述融合组件对所述可见光模态特征与所述热红外模态特征进行逐元素相加操作，得到所述多模态融合图像特征；The fusion component performs an element-by-element addition operation on the visible light modal features and the thermal infrared modal features to obtain the multi-modal fused image features;

将所述多模态融合图像特征输入至所述第一解码网络和所述第二解码网络中分别进行特征重建，对应得到所述可见光重建特征和所述热红外重建特征。The multi-modal fusion image features are input into the first decoding network and the second decoding network for feature reconstruction respectively, and the visible light reconstruction features and the thermal infrared reconstruction features are correspondingly obtained.

在其中一个实施方式，所述多模态融合图像特征的确定方式，包括：In one embodiment, the method for determining multi-modal fusion image features includes:

获取所述可见光样本图像的可见光模态特征和所述热红外样本图像的热红外模态特征；Obtaining the visible light modal characteristics of the visible light sample image and the thermal infrared modal characteristics of the thermal infrared sample image;

对所述热红外模态特征和所述可见光模态特征进行逐元素相加操作，得到所述多模态融合图像特征。An element-by-element addition operation is performed on the thermal infrared modal features and the visible light modal features to obtain the multi-modal fusion image features.

在其中一个实施方式，所述获取针对目标行人场景的可见光样本图像和热红外样本图像，包括：In one embodiment, the acquisition of visible light sample images and thermal infrared sample images for the target pedestrian scene includes:

获取针对所述目标行人场景进行拍摄得到的初始可见光图像和初始热红外图像；Obtaining an initial visible light image and an initial thermal infrared image taken for the target pedestrian scene;

按照所述初始行人检测模型的输入数据尺寸对所述初始可见光图像和所述初始热红外图像进行预处理，得到所述可见光样本图像和所述热红外样本图像；其中，将归一化处理后的所述可见光样本图像和归一化处理后的所述热红外样本图像作为所述初始行人检测模型的输入数据。The initial visible light image and the initial thermal infrared image are preprocessed according to the input data size of the initial pedestrian detection model to obtain the visible light sample image and the thermal infrared sample image; wherein, after normalization, The visible light sample image and the normalized thermal infrared sample image are used as input data of the initial pedestrian detection model.

在其中一个实施方式，采用以下方式对所述可见光样本图像和所述热红外样本图像进行归一化处理：In one embodiment, the visible light sample image and the thermal infrared sample image are normalized in the following manner:

根据所述可见光样本图像对应的第一通道均值以及第一通道标准差，对所述可见光样本图像进行归一化处理，得到所述可见光样本图像对应的输入数据；According to the first channel mean and the first channel standard deviation corresponding to the visible light sample image, normalize the visible light sample image to obtain input data corresponding to the visible light sample image;

根据所述热红外样本图像对应的第二通道均值以及第二通道标准差，对所述热红外样本图像进行归一化处理，得到所述热红外样本图像对应的输入数据。According to the second channel mean value and the second channel standard deviation corresponding to the thermal infrared sample image, the thermal infrared sample image is normalized to obtain input data corresponding to the thermal infrared sample image.

在其中一个实施方式，采用以下损失函数计算模型损失数据：In one implementation, the following loss function is used to calculate model loss data:

L＝L1+L2+L3+L4L＝L1+L2+L3+L4

L1＝MSE(RGB_i，RGB′_i)L1=MSE( _RGBi , _RGB′i )

L2＝MSE(TH_i，TH′_i)L2=MSE (TH _i , TH′ _i )

L3＝MSE(RGB_i，TH′_i)L3=MSE( _RGBi , _TH′i )

L4＝MSE(TH_i，RGB′_i)L4=MSE (TH _i , RGB′ _i )

其中，L为模型损失数据，L1为第一单模态相似损失，L2为第二单模态相似损失，L3为第一多模态交互损失，L4为第二多模态交互损失，RGB_i为可见光提取特征，RGB′_i为可见光重建特征，TH_i为热红外提取特征，TH′_i为热红外重建特征，i表示特征计数编号，MSE表示均方差损失。Among them, L is the model loss data, L1 is the first single-modal similarity loss, L2 is the second single-modal similarity loss, L3 is the first multi-modal interaction loss, L4 is the second multi-modal interaction loss, RGB _i is the visible light extraction feature, RGB′ _i is the visible light reconstruction feature, TH _i is the thermal infrared extraction feature, TH′ _i is the thermal infrared reconstruction feature, i represents the feature count number, and MSE represents the mean square error loss.

本说明书实施方式提供一种行人检测方法，所述方法包括：The embodiment of this specification provides a pedestrian detection method, which method includes:

获取针对任一行人场景拍摄得到的可见光待检图像和热红外待检图像；Obtain visible light inspection images and thermal infrared inspection images taken for any pedestrian scene;

将所述可见光待检图像以及所述可见光待检图像输入至上述任一实施方式的方法得到的目标行人检测模型中进行行人检测，得到行人检测结果。The visible light image to be detected and the visible light image to be detected are input into the target pedestrian detection model obtained by the method in any of the above embodiments to perform pedestrian detection and obtain a pedestrian detection result.

将所述可见光待检图像以及所述可见光待检图像输入至目标行人检测模型中进行行人检测，得到行人检测结果；其中，所述目标行人检测模型在训练过程的损失数据包括单模态损失和跨模态损失；所述单模态损失和所述跨模态损失一起用于监督训练所述目标行人检测模提取所述可见光待检图像和所述热红外待检图像间的多模态融合图像特征的能力；The visible light image to be detected and the visible light image to be detected are input into the target pedestrian detection model for pedestrian detection, and pedestrian detection results are obtained; wherein the loss data of the target pedestrian detection model during the training process includes single-modal loss and Cross-modal loss; the single-modal loss and the cross-modal loss are used together to supervise the training of the target pedestrian detection model to extract the multi-modal fusion between the visible light image to be detected and the thermal infrared image to be detected. Image feature capabilities;

所述单模态损失包括可见光样本图像的可见光提取特征与可见光重建特征间的第一单模态相似损失，以及热红外样本图像的热红外提取特征与热红外重建特征间的第二单模态相似损失；The single-modal loss includes a first single-modal similarity loss between the visible light extraction features and the visible light reconstruction features of the visible light sample image, and a second single-modal similarity loss between the thermal infrared extraction features and the thermal infrared reconstruction features of the thermal infrared sample image. similar losses;

所述跨模态损失包括所述热红外重建特征与所述可见光提取特征间的第一多模态交互损失、所述可见光重建特征与所述热红外提取特征间的第二多模态交互损失。The cross-modal loss includes a first multi-modal interaction loss between the thermal infrared reconstruction feature and the visible light extraction feature, and a second multi-modal interaction loss between the visible light reconstruction feature and the thermal infrared extraction feature. .

在其中一个实施方式，所述目标行人检测模型包括并联的第一编码网络和第二编码网络，所述第一编码网络和第二编码网络共同连接至融合组件；所述将所述可见光待检图像以及所述可见光待检图像输入至目标行人检测模型中进行行人检测，得到行人检测结果，包括：In one embodiment, the target pedestrian detection model includes a parallel first encoding network and a second encoding network, the first encoding network and the second encoding network are jointly connected to a fusion component; the visible light to be detected is The image and the visible light image to be detected are input into the target pedestrian detection model for pedestrian detection, and pedestrian detection results are obtained, including:

将所述可见光待检图像输入至所述第一编码网络中进行特征提取，得到可见光待检特征；Input the visible light image to be detected into the first encoding network for feature extraction to obtain visible light features to be detected;

将所述热红外待检图像输入至所述第二编码网络中进行特征提取，得到热红外待检特征；Input the thermal infrared image to be inspected into the second encoding network for feature extraction to obtain the thermal infrared image to be inspected;

通过所述融合组件对所述可见光待检特征与所述热红外待检特征进行逐元素相加操作，得到待检融合图像特征；The fusion component performs an element-by-element addition operation on the visible light features to be inspected and the thermal infrared features to be inspected to obtain the fused image features to be inspected;

基于所述待检融合图像特征进行目标检测，得到所述行人检测结果。Target detection is performed based on the fused image features to be detected, and the pedestrian detection result is obtained.

在其中一个实施方式，所述基于所述待检融合图像特征进行行人检测，得到所述行人检测结果，包括：In one embodiment, performing pedestrian detection based on the fused image features to be detected to obtain the pedestrian detection result includes:

根据所述待检融合图像特征进行卷积、池化、激活处理操作，以进行目标检测，得到所述行人检测结果。Convolution, pooling, and activation processing operations are performed according to the fused image features to be detected to perform target detection and obtain the pedestrian detection result.

本说明书实施方式提供一种行人检测模型的训练装置，所述装置包括：The embodiment of this specification provides a training device for a pedestrian detection model. The device includes:

样本图像获取模块，用于获取针对目标行人场景的可见光样本图像和热红外样本图像；The sample image acquisition module is used to acquire visible light sample images and thermal infrared sample images of the target pedestrian scene;

图像特征重建模块，用于基于所述可见光样本图像和所述热红外样本图像间的多模态融合图像特征进行特征重建，得到可见光重建特征和热红外重建特征；其中，所述多模态融合图像特征是基于所述可见光样本图像的特征和所述热红外样本图像的特征进行逐元素相加操作得到的；An image feature reconstruction module, configured to perform feature reconstruction based on the multi-modal fusion image features between the visible light sample image and the thermal infrared sample image, to obtain visible light reconstruction features and thermal infrared reconstruction features; wherein, the multi-modal fusion The image features are obtained by element-by-element addition based on the features of the visible light sample image and the features of the thermal infrared sample image;

损失数据确定模块，用于确定所述可见光样本图像的可见光提取特征与所述可见光重建特征间的第一单模态相似损失、所述热红外样本图像的热红外提取特征与所述热红外重建特征间的第二单模态相似损失、所述热红外重建特征与所述可见光提取特征间的第一多模态交互损失、所述可见光重建特征与所述热红外提取特征间的第二多模态交互损失；Loss data determination module, used to determine the first single-modal similarity loss between the visible light extraction feature of the visible light sample image and the visible light reconstruction feature, the thermal infrared extraction feature of the thermal infrared sample image and the thermal infrared reconstruction feature The second single-modal similarity loss between features, the first multi-modal interaction loss between the thermal infrared reconstruction feature and the visible light extraction feature, the second multi-modal interaction loss between the visible light reconstruction feature and the thermal infrared extraction feature Modal interaction loss;

检测模型确定模块，用于根据所述第一单模态相似损失、所述第二单模态相似损失、所述第一多模态交互损失以及所述第二多模态交互损失，对初始行人检测模型同时进行单模态监督训练和跨模态监督训练，直至满足模型停止训练条件，得到目标行人检测模型。a detection model determination module, configured to determine the initial The pedestrian detection model performs single-modal supervised training and cross-modal supervised training at the same time until the model stop training conditions are met, and the target pedestrian detection model is obtained.

本说明书实施方式提供一种行人检测装置，所述装置包括：The embodiment of this specification provides a pedestrian detection device, which includes:

待检图像获取模块，用于获取针对任一行人场景拍摄得到的可见光待检图像和热红外待检图像；The image acquisition module to be inspected is used to acquire the visible light image to be inspected and the thermal infrared image to be inspected that are captured for any pedestrian scene;

检测结果确定模块，用于将所述可见光待检图像以及所述可见光待检图像输入至上述任一实施方式得到的目标行人检测模型中进行行人检测，得到行人检测结果。The detection result determination module is used to input the visible light image to be detected and the visible light image to be detected into the target pedestrian detection model obtained in any of the above embodiments to perform pedestrian detection and obtain pedestrian detection results.

检测结果确定模块，用于将所述可见光待检图像以及所述可见光待检图像输入至目标行人检测模型中进行行人检测，得到行人检测结果；其中，所述目标行人检测模型在训练过程的损失数据包括单模态损失和跨模态损失；所述单模态损失和所述跨模态损失一起用于监督训练所述目标行人检测模提取所述可见光待检图像和所述热红外待检图像间的多模态融合图像特征的能力；The detection result determination module is used to input the visible light image to be detected and the visible light image to be detected into a target pedestrian detection model for pedestrian detection, and obtain pedestrian detection results; wherein, the loss of the target pedestrian detection model during the training process The data includes single-modal loss and cross-modal loss; the single-modal loss and the cross-modal loss are used together to supervise the training of the target pedestrian detection model to extract the visible light image to be detected and the thermal infrared image to be detected. The ability of multi-modal fusion of image features between images;

本说明书实施方式提供计算机可读存储介质，其上存储有计算机程序，所述计算机程序被处理器执行时实现上述任一项实施方式所述的方法的步骤。Embodiments of this specification provide a computer-readable storage medium on which a computer program is stored. When the computer program is executed by a processor, the steps of the method described in any of the above embodiments are implemented.

本说明书实施方式提供一种计算机程序产品，所述计算机程序产品中包括指令，所述指令被计算机设备的处理器执行时，使得所述计算机设备能够执行上述任一项实施方式所述的方法的步骤。An embodiment of this specification provides a computer program product, which includes instructions. When the instructions are executed by a processor of a computer device, the computer device can perform the method described in any of the above embodiments. step.

本说明书实施方式提供一种芯片，包括存储单元和处理单元，所述存储单元存储有计算机程序，所述处理单元执行所述计算机程序时实现上述任一项实施方式所述的方法的步骤。An embodiment of this specification provides a chip that includes a storage unit and a processing unit. The storage unit stores a computer program. When the processing unit executes the computer program, it implements the steps of the method described in any of the above embodiments.

上述说明书实施方式中，首先，获取针对目标行人场景的可见光样本图像和热红外样本图像；接着，基于可见光样本图像的特征和热红外样本图像的特征进行逐元素相加操作得到多模态融合图像特征，并基于多模态融合图像特征进行特征重建，得到可见光重建特征和热红外重建特征；最后，根据可见光样本图像的可见光提取特征与可见光重建特征间的第一单模态相似损失、热红外样本图像的热红外提取特征与热红外重建特征间的第二单模态相似损失、热红外重建特征与可见光提取特征间的第一多模态交互损失、可见光重建特征与热红外提取特征间的第二多模态交互损失，对初始行人检测模型同时进行单模态监督训练和跨模态监督训练，以得到目标行人检测模型。一方面，通过逐元素相加操作实现特征初步融合，得到多模态融合图像特征，减少可见光图像与热红外图像两种模态特征之间的相互干扰；另一方面，通过第一单模态相似损失、第二单模态相似损失、第一多模态交互损失以及第二多模态交互损失这四种损失对模型参数优化更新，使得模型具备获得更准确的多模态融合图像特征的能力，从而可以提高多模态融合图像特征的质量，进而提高行人检测结果的准确性。In the implementation of the above description, first, a visible light sample image and a thermal infrared sample image for the target pedestrian scene are obtained; then, based on the characteristics of the visible light sample image and the characteristics of the thermal infrared sample image, an element-by-element addition operation is performed to obtain a multi-modal fusion image Features, and perform feature reconstruction based on multi-modal fusion image features to obtain visible light reconstruction features and thermal infrared reconstruction features; finally, based on the first single-modal similarity loss between the visible light extraction features of the visible light sample image and the visible light reconstruction features, thermal infrared The second single-modal similarity loss between the thermal infrared extraction features and the thermal infrared reconstruction features of the sample image, the first multi-modal interaction loss between the thermal infrared reconstruction features and the visible light extraction features, and the first multi-modal interaction loss between the visible light reconstruction features and the thermal infrared extraction features. The second multi-modal interaction loss is to perform single-modal supervised training and cross-modal supervised training on the initial pedestrian detection model at the same time to obtain the target pedestrian detection model. On the one hand, preliminary feature fusion is achieved through element-by-element addition operations to obtain multi-modal fusion image features, reducing the mutual interference between the two modal features of visible light images and thermal infrared images; on the other hand, through the first single modal The four losses of similarity loss, second single-modal similarity loss, first multi-modal interaction loss and second multi-modal interaction loss optimize and update the model parameters, making the model capable of obtaining more accurate multi-modal fusion image features. ability, which can improve the quality of multi-modal fusion image features, thereby improving the accuracy of pedestrian detection results.

附图说明Description of the drawings

图1a为本说明书实施方式提供的基于YOLOv5框架的训练初始行人检测模型的示意图；Figure 1a is a schematic diagram of the training initial pedestrian detection model based on the YOLOv5 framework provided by the implementation of this specification;

图1b为本说明书实施方式提供的基于YOLOv5框架的目标行人检测模型的示意图；Figure 1b is a schematic diagram of the target pedestrian detection model based on the YOLOv5 framework provided by the implementation of this specification;

图1c为本说明书实施方式提供的行人检测模型的训练方法的流程示意图；Figure 1c is a schematic flowchart of the training method of the pedestrian detection model provided by the embodiment of this specification;

图2a为本说明书实施方式提供的得到可见光重建特征和热红外重建特征的流程示意图；Figure 2a is a schematic flowchart of obtaining visible light reconstruction features and thermal infrared reconstruction features provided by the embodiment of this specification;

图2b为本说明书实施方式提供的第一编码网络的结构的示意图；Figure 2b is a schematic diagram of the structure of the first coding network provided by the embodiment of this specification;

图2c为本说明书实施方式提供的CBS模块的示意图；Figure 2c is a schematic diagram of the CBS module provided by the embodiment of this specification;

图2d为本说明书实施方式提供的CSP1_X模块的示意图；Figure 2d is a schematic diagram of the CSP1_X module provided by the implementation of this specification;

图2e为本说明书实施方式提供的Res unit模块的示意图；Figure 2e is a schematic diagram of the Res unit module provided by the embodiment of this specification;

图2f为本说明书实施方式提供的CSP2_X模块的示意图；Figure 2f is a schematic diagram of the CSP2_X module provided by the implementation of this specification;

图2g为本说明书实施方式提供的SPPF模块的示意图；Figure 2g is a schematic diagram of the SPPF module provided in the embodiment of this specification;

图2h为本说明书实施方式提供的确定多模态融合图像特征的示意图；Figure 2h is a schematic diagram of determining multi-modal fusion image features provided by the embodiment of this specification;

图2i为本说明书实施方式提供的融合组件的示意图；Figure 2i is a schematic diagram of a fusion component provided by an embodiment of this specification;

图2j为本说明书实施方式提供的第一解码网络的示意图；Figure 2j is a schematic diagram of the first decoding network provided by the embodiment of this specification;

图3为本说明书实施方式提供的得到多模态融合图像特征的流程示意图；Figure 3 is a schematic flowchart of obtaining multi-modal fusion image features provided by the embodiment of this specification;

图4为本说明书实施方式提供的得到可见光样本图像和热红外样本图像的流程示意图；Figure 4 is a schematic flowchart of obtaining a visible light sample image and a thermal infrared sample image provided by the embodiment of this specification;

图5为本说明书实施方式提供的行人检测方法的流程示意图；Figure 5 is a schematic flowchart of the pedestrian detection method provided by the embodiment of this specification;

图6为本说明书实施方式提供的行人检测方法的流程示意图；Figure 6 is a schematic flow chart of the pedestrian detection method provided by the embodiment of this specification;

图7为本说明书实施方式提供的得到行人检测结果的流程示意图；Figure 7 is a schematic flowchart of obtaining pedestrian detection results provided by the embodiment of this specification;

图8为本说明书实施方式提供的行人检测模型的训练方法的流程示意图；Figure 8 is a schematic flow chart of the training method of the pedestrian detection model provided by the embodiment of this specification;

图9为本说明书实施方式提供的行人检测模型的训练装置的示意图；Figure 9 is a schematic diagram of the training device of the pedestrian detection model provided by the embodiment of this specification;

图10为本说明书实施方式提供的行人检测装置的示意图；Figure 10 is a schematic diagram of a pedestrian detection device provided in an embodiment of this specification;

图11为本说明书实施方式提供的行人检测装置的示意图；Figure 11 is a schematic diagram of a pedestrian detection device provided in an embodiment of this specification;

图12为本说明书实施方式提供的计算机设备的内部结构图。FIG. 12 is an internal structure diagram of a computer device provided in an embodiment of this specification.

具体实施方式Detailed ways

下面详细描述本发明的实施例，所述实施例的示例在附图中示出，其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的，旨在用于解释本发明，而不能理解为对本发明的限制。Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein the same or similar reference numerals throughout represent the same or similar elements or elements with the same or similar functions. The embodiments described below with reference to the drawings are exemplary and are intended to explain the present invention and are not to be construed as limiting the present invention.

在利用卷积神经网络进行行人检测的任务中，卷积神经网络的深度对模型的性能影响是很重要的。在使用多模态数据进行行人检测的过程中，当增加卷积神经网络层数后，卷积神经网络可以进行更加复杂的特征模式的提取，所以当模型提取多模态特征进行行人检测时，可以取得较好的结果。In the task of pedestrian detection using convolutional neural networks, the depth of the convolutional neural network has an important impact on the performance of the model. In the process of using multi-modal data for pedestrian detection, when the number of convolutional neural network layers is increased, the convolutional neural network can extract more complex feature patterns. Therefore, when the model extracts multi-modal features for pedestrian detection, Better results can be achieved.

但是，随着卷积神经网络多模态特征的提取，多模态特征之间会相互干扰，卷积神经网络检测的准确度出现饱和，甚至出现下降。残差网络利用短路连接可以解决准确度出现饱和，甚至出现下降的问题，但是残差网络做的是全等映射，会将冗余的特征传递到后面的卷积层，冗余的信息会对行人检测造成干扰，影响对行人的检测。However, with the extraction of multi-modal features by convolutional neural networks, multi-modal features will interfere with each other, and the detection accuracy of convolutional neural networks will become saturated or even decline. The residual network uses short-circuit connections to solve the problem of saturation or even decline in accuracy. However, the residual network does congruent mapping, which transfers redundant features to subsequent convolutional layers. The redundant information will affect Pedestrian detection causes interference and affects the detection of pedestrians.

相关技术中，特征融合方法可以直接使用融合特征进行行人检测。但是，相关技术中融合特征的质量可能较低，且由于相关技术中特征融合方法并没有考虑到特征的选择性，因此不能充分利用多模态数据之间的互补性特征，从而影响行人检测的准确率。因此，在利用多模态特征进行行人检测时需要提升融合特征的质量。In related technologies, feature fusion methods can directly use fused features for pedestrian detection. However, the quality of fused features in related technologies may be low, and because the feature fusion methods in related technologies do not take into account the selectivity of features, they cannot fully utilize the complementary features between multi-modal data, thus affecting the performance of pedestrian detection. Accuracy. Therefore, it is necessary to improve the quality of fused features when using multi-modal features for pedestrian detection.

基于此，本说明书实施方式提供一种行人检测模型的训练方法。首先，获取针对目标行人场景的可见光样本图像和热红外样本图像；接着，基于可见光样本图像的特征和热红外样本图像的特征进行逐元素相加操作得到多模态融合图像特征；并基于多模态融合图像特征进行特征重建，得到可见光重建特征和热红外重建特征。最后，根据可见光样本图像的可见光提取特征与可见光重建特征间的第一单模态相似损失、热红外样本图像的热红外提取特征与热红外重建特征间的第二单模态相似损失、热红外重建特征与可见光提取特征间的第一多模态交互损失、可见光重建特征与热红外提取特征间的第二多模态交互损失；根据第一单模态相似损失、第二单模态相似损失、第一多模态交互损失以及第二多模态交互损失，对初始行人检测模型同时进行单模态监督训练和跨模态监督训练，直至满足模型停止训练条件，得到目标行人检测模型。Based on this, embodiments of this specification provide a training method for a pedestrian detection model. First, obtain the visible light sample image and thermal infrared sample image for the target pedestrian scene; then, perform an element-by-element addition operation based on the characteristics of the visible light sample image and the characteristics of the thermal infrared sample image to obtain multi-modal fusion image features; and based on the multi-modal State fusion image features are used for feature reconstruction to obtain visible light reconstruction features and thermal infrared reconstruction features. Finally, according to the first single-modal similarity loss between the visible light extraction features and the visible light reconstruction features of the visible light sample image, the second single-modal similarity loss between the thermal infrared extraction features and the thermal infrared reconstruction features of the thermal infrared sample image, the thermal infrared The first multi-modal interaction loss between the reconstructed features and the visible light extraction features, the second multi-modal interaction loss between the visible light reconstruction features and the thermal infrared extraction features; according to the first single-modal similarity loss, the second single-modal similarity loss , the first multi-modal interaction loss and the second multi-modal interaction loss, perform single-modal supervised training and cross-modal supervised training on the initial pedestrian detection model at the same time until the model stop training conditions are met, and the target pedestrian detection model is obtained.

本说明书实施方式中，一方面，通过逐元素相加操作实现特征初步融合，得到多模态融合图像特征，减少可见光图像与热红外图像两种模态特征之间的相互干扰；另一方面，通过第一单模态相似损失、第二单模态相似损失、第一多模态交互损失以及第二多模态交互损失这四种损失对模型参数优化更新，使得模型具备获得更准确的多模态融合图像特征的能力，从而可以提高多模态融合图像特征的质量，进而提高行人检测结果的准确性。In the implementation of this specification, on the one hand, preliminary feature fusion is achieved through element-by-element addition operations to obtain multi-modal fusion image features, reducing mutual interference between the two modal features of visible light images and thermal infrared images; on the other hand, The model parameters are optimized and updated through four losses: the first single-modal similarity loss, the second single-modal similarity loss, the first multi-modal interaction loss, and the second multi-modal interaction loss, so that the model has the ability to obtain more accurate multi-modal parameters. The ability of modal fusion image features can improve the quality of multi-modal fusion image features, thereby improving the accuracy of pedestrian detection results.

本说明书实施方式提供一个行人检测模型训练方法的场景示例。通过摄像机采集真实出应用场景的初始可见光图像(RGB)和初始热红外图像(Thermal)，摄像机将初始可见光图像和初始热红外图像上传至服务器。服务器对初始可见光图像和初始热红外图像进行图像预处理得到RGB图像和Thermal图像。服务器侧可以构建训练样本集，训练样本集中的每个训练样本包括可见光图像和热红外图像。The embodiment of this specification provides a scenario example of a pedestrian detection model training method. The camera collects the initial visible light image (RGB) and the initial thermal infrared image (Thermal) of the actual application scene, and the camera uploads the initial visible light image and the initial thermal infrared image to the server. The server performs image preprocessing on the initial visible light image and the initial thermal infrared image to obtain an RGB image and a Thermal image. The server side can build a training sample set, and each training sample in the training sample set includes visible light images and thermal infrared images.

在模型训练阶段，将RGB图像和Thermal图像输入至初始行人检测模型中。请参阅图1a，目标行人检测模型包括编码器106、编码器108、Fusion模块110、解码器112、解码器114。将RGB图像102输入至编码器106进行特征提取，可以得到RGB图像102的可见光模态特征，在通过编码器106对应的卷积模块时，可以得到RGB图像102的可见光提取特征。将Thermal图像104输入至编码器108进行特征提取，可以得到Thermal图像104的热红外模态特征，在通过编码器108对应的卷积模块时，可以得到Thermal图像104的热红外提取特征。通过Fusion模块可以将可见光模态特征与热红外模态特征以逐像素相加的方式进行特征融合，得到多模态融合图像特征。将多模态融合图像特征分别输入至解码器112和解码器114，通过解码器112可以实现对RGB图像数据的重建，可以得到可见光重建特征，通过解码器114可以实现对Thermal图像数据的重建，可以得到热红外重建特征。通过可见光提取特征与可见光重建特征进行相似度比对，可以得到第一单模态相似损失Loss1。通过热红外提取特征与热红外重建特征进行相似度比对，可以得到第二单模态相似损失Loss2。通过可见光提取特征与热红外重建特征进行交互监督，可以得到第一多模态交互损失Loss3。通过热红外提取特征与可见光重建特征进行交互监督，可以得到第二多模态交互损失Loss4。将单模态相似损失与多模态交互损失进行相加可以得到损失数据，通过损失数据可以对初始行人检测模型进行参数更新，直至满足模型停止训练条件，得到目标行人检测模型。示例性地，基于第一单模态相似损失Loss1、第二单模态相似损失Loss2、第一多模态交互损失Loss3以及第二多模态交互损失Loss4对编码器106、编码器108、解码器112、解码器114的参数进行更新。需要说明的是，Fusion模块110是对可见光模态特征和热红外模态特征进行初步融合(即逐像素相加)，该Fusion模块110不需要进行参数更新。In the model training stage, RGB images and Thermal images are input into the initial pedestrian detection model. Please refer to Figure 1a. The target pedestrian detection model includes an encoder 106, an encoder 108, a Fusion module 110, a decoder 112, and a decoder 114. By inputting the RGB image 102 to the encoder 106 for feature extraction, the visible light modal features of the RGB image 102 can be obtained. When passing through the corresponding convolution module of the encoder 106, the visible light extraction features of the RGB image 102 can be obtained. By inputting the Thermal image 104 to the encoder 108 for feature extraction, the thermal infrared modal features of the Thermal image 104 can be obtained. When passing through the corresponding convolution module of the encoder 108, the thermal infrared extraction features of the Thermal image 104 can be obtained. Through the Fusion module, visible light modal features and thermal infrared modal features can be fused in a pixel-by-pixel addition manner to obtain multi-modal fusion image features. The multi-modal fusion image features are input to the decoder 112 and the decoder 114 respectively. The decoder 112 can realize the reconstruction of the RGB image data and obtain the visible light reconstruction features. The decoder 114 can realize the reconstruction of the Thermal image data. Thermal infrared reconstruction features can be obtained. By comparing the similarity between visible light extraction features and visible light reconstruction features, the first single-mode similarity loss Loss1 can be obtained. Through similarity comparison between thermal infrared extracted features and thermal infrared reconstructed features, the second single-modal similarity loss Loss2 can be obtained. Through interactive supervision of visible light extraction features and thermal infrared reconstruction features, the first multi-modal interactive loss Loss3 can be obtained. Through interactive supervision of thermal infrared extraction features and visible light reconstruction features, the second multi-modal interaction loss Loss4 can be obtained. The loss data can be obtained by adding the single-modal similarity loss and the multi-modal interaction loss. Through the loss data, the parameters of the initial pedestrian detection model can be updated until the model stop training conditions are met, and the target pedestrian detection model is obtained. Exemplarily, the encoder 106, the encoder 108, and the decoding are based on the first single-modal similarity loss Loss1, the second single-modal similarity loss Loss2, the first multi-modal interaction loss Loss3, and the second multi-modal interaction loss Loss4. The parameters of the decoder 112 and the decoder 114 are updated. It should be noted that the Fusion module 110 performs preliminary fusion (ie, pixel-by-pixel addition) of visible light modal features and thermal infrared modal features, and the Fusion module 110 does not need to update parameters.

在模型推理阶段，将待检测可见光图像和待检测热红外图像输入至目标行人检测模型中。请参阅图1b，目标行人检测模型包括参数更新后的编码器128、参数更新后的编码器130、Fusion模块110以及CONV模块(Standard convolution mudule，标准卷积模块)132。可以将待检测可见光图像124作为编码器128的输入，通过编码器128可以得到待检测可见光图像124的可见光待检特征，将待检测热红外图像126作为编码器130的输入，通过编码器130可以得到待检测热红外图像的热红外待检特征。通过Fusion模块(融合模块)110将可见光待检特征与热红外待检特征进行特征融合，得到待检融合图像特征。将待检融合图像特征通过CONV模块132可以对待检融合图像特征进行卷积处理操作，以进行目标检测，得到行人检测结果134。In the model inference stage, the visible light image to be detected and the thermal infrared image to be detected are input into the target pedestrian detection model. Referring to Figure 1b, the target pedestrian detection model includes a parameter-updated encoder 128, a parameter-updated encoder 130, a Fusion module 110, and a CONV module (Standard convolution mudule, standard convolution module) 132. The visible light image 124 to be detected can be used as the input of the encoder 128, and the visible light characteristics to be detected of the visible light image 124 to be detected can be obtained through the encoder 128, and the thermal infrared image 126 to be detected can be used as the input of the encoder 130, and the encoder 130 can The thermal infrared features to be detected of the thermal infrared image to be detected are obtained. Through the Fusion module (fusion module) 110, the visible light features to be inspected and the thermal infrared features to be inspected are feature fused to obtain the fused image features to be inspected. The fused image features to be detected can be subjected to a convolution processing operation through the CONV module 132 to perform target detection and obtain pedestrian detection results 134 .

本说明书实施方式提供一种行人检测模型的训练方法，请参阅图1c，该方法可以包括以下步骤：The embodiment of this specification provides a training method for a pedestrian detection model. Please refer to Figure 1c. The method may include the following steps:

S110、获取针对目标行人场景的可见光样本图像和热红外样本图像。S110. Obtain visible light sample images and thermal infrared sample images of the target pedestrian scene.

其中，可见光样本图像可以是以符合人类视觉系统的方式提供具有高空间分辨率和清晰度的纹理细节，捕获反射光。例如可见光样本图像可以是RGB图像，RGB图像拥有三个通道，包含红绿蓝的可见光颜色信息。热红外样本图像可以是根据热辐射差异将目标与其背景区分开来，这在全天候和全白天/夜间条件下都能很好地工作。热红外样本图像只有一个通道，包含近红外光的强度信息，热红外样本图像可以是灰度图像。从成像原理的角度，可见光样本图像和热红外样本图像的波长范围也有所区别，不同的清晰度和光照条件在两类图像上所能产生的效果也可能会大相径庭。红外线可以检测到人体发出的热能，而行人的热量在热红外样本图像中会呈现出明显的特征。这使得热红外样本图像可以在一些情况下更加准确地判断目标是否为行人，特别是在夜间或者低光线环境下。因此，热红外样本图像在行人检测中有着十分重要的作用，可以提高行人检测的准确性和可靠性。可见光样本图像和热红外样本图像可以是同一时刻、同一场景下的目标行人场景的样本图像，可见光样本图像和热红外样本图像可以作为一对样本用于训练初始行人检测模型，以得到目标行人检测模型。Among them, visible light sample images can provide texture details with high spatial resolution and clarity in a manner consistent with the human visual system, capturing reflected light. For example, the visible light sample image can be an RGB image. The RGB image has three channels and contains red, green, and blue visible light color information. Thermal infrared sample images can distinguish targets from their background based on differences in thermal radiation, which works well in all-weather and all-day/night conditions. The thermal infrared sample image has only one channel, which contains the intensity information of near-infrared light. The thermal infrared sample image can be a grayscale image. From the perspective of imaging principles, the wavelength ranges of visible light sample images and thermal infrared sample images are also different, and different sharpness and lighting conditions may produce very different effects on the two types of images. Infrared rays can detect the heat energy emitted by the human body, and the heat of pedestrians will show obvious features in the thermal infrared sample image. This allows thermal infrared sample images to more accurately determine whether the target is a pedestrian in some situations, especially at night or in low-light environments. Therefore, thermal infrared sample images play a very important role in pedestrian detection and can improve the accuracy and reliability of pedestrian detection. The visible light sample image and the thermal infrared sample image can be sample images of the target pedestrian scene at the same time and in the same scene. The visible light sample image and the thermal infrared sample image can be used as a pair of samples to train the initial pedestrian detection model to obtain the target pedestrian detection. Model.

具体地，服务器本地存储有训练样本集，从训练样本集中直接获取到针对目标行人场景的可见光样本图像和热红外样本图像。在另一些实施方式中，可以通过图像采集设备对目标行人场景继续拍摄，得到可见光图像和热红外图像，将得到的可见光图像和热红外图像发送至服务器，服务器基于可见光图像和热红外图像进行裁剪处理或者数据增强处理，得到针对目标行人场景的可见光样本图像和热红外样本图像。Specifically, the server locally stores a training sample set, and the visible light sample image and thermal infrared sample image of the target pedestrian scene are directly obtained from the training sample set. In other embodiments, the target pedestrian scene can be continuously photographed through an image acquisition device to obtain a visible light image and a thermal infrared image, and the obtained visible light image and thermal infrared image are sent to the server, and the server performs cropping based on the visible light image and the thermal infrared image. Processing or data enhancement processing to obtain visible light sample images and thermal infrared sample images of the target pedestrian scene.

示例性地，可见光样本图像可以是640*640*3的图像，热红外样本图像可以是640*640*1的图像。For example, the visible light sample image may be a 640*640*3 image, and the thermal infrared sample image may be a 640*640*1 image.

需要说明的是，图像采集设备可以是摄像机、照相机、红外相机、鱼眼相机等中的至少一种。It should be noted that the image acquisition device may be at least one of a video camera, a still camera, an infrared camera, a fisheye camera, etc.

S120、基于可见光样本图像和热红外样本图像间的多模态融合图像特征进行特征重建，得到可见光重建特征和热红外重建特征。S120. Perform feature reconstruction based on the multi-modal fusion image features between the visible light sample image and the thermal infrared sample image, and obtain visible light reconstruction features and thermal infrared reconstruction features.

其中，多模态融合图像特征是基于可见光样本图像的特征和热红外样本图像的特征进行逐元素相加操作得到的。特征重建通常是指使用一些算法或模型，从原始数据中提取出代表不同特征的重要信息，然后利用这些信息重新构建数据。Among them, the multi-modal fusion image features are obtained by element-by-element addition based on the features of the visible light sample image and the features of the thermal infrared sample image. Feature reconstruction usually refers to using some algorithms or models to extract important information representing different features from the original data, and then using this information to reconstruct the data.

在一些情况下，在行人检测任务中，简单的使用可见光样本图像进行检测会受到各种因素的影响，比如阴天、雨天、雾霾等。在夜间和低光环境下，传统的可见光图像采集设备无法获取清晰的图像，因此通过可见光样本图像进行行人检测是比较困难的，而红外图像采集设备可以利用红外线照射目标，从而在夜间和低光环境下获取明亮的图像，可以提高行人检测的准确率。红外样本图像可以穿透雾霾、烟雾、雨雪等天气条件，不会受到环境光线影响。因此，在极端天气下，通过红外样本图像可以对行人进行准确的检测。红外图像和可见光图像具有互补性，基于可见光样本图像的特征和热红外样本图像的特征进行特征融合，可以得到鲁棒性强、信息量大的多模态融合图像特征。本文采用融合可见光图像和热红外图像的多模态方法，实现提高行人检测目的的效果。In some cases, in pedestrian detection tasks, simply using visible light sample images for detection will be affected by various factors, such as cloudy days, rainy days, haze, etc. At night and in low-light environments, traditional visible light image acquisition equipment cannot obtain clear images, so it is difficult to detect pedestrians through visible light sample images, while infrared image acquisition equipment can use infrared rays to illuminate targets, thereby detecting pedestrians at night and in low light. Obtaining bright images in the environment can improve the accuracy of pedestrian detection. Infrared sample images can penetrate haze, smog, rain, snow and other weather conditions and will not be affected by ambient light. Therefore, in extreme weather, pedestrians can be accurately detected through infrared sample images. Infrared images and visible light images are complementary. Feature fusion based on the characteristics of the visible light sample image and the characteristics of the thermal infrared sample image can obtain multi-modal fusion image features with strong robustness and large amount of information. This paper adopts a multi-modal method that fuses visible light images and thermal infrared images to achieve the purpose of improving pedestrian detection.

具体地，将可见光样本图像的特征和热红外样本图像的特征进行逐元素相加操作，可以得到多模态融合图像特征。通过可见光模态对应的解码器利用该多模态融合图像特征进行特征重建，得到可见光重建特征。通过热红外模态对应的解码器利用该多模态融合图像特征进行特征重建，得到热红外重建特征。Specifically, the features of the visible light sample image and the features of the thermal infrared sample image are added element by element to obtain multi-modal fusion image features. The decoder corresponding to the visible light mode uses the multi-modal fusion image features to perform feature reconstruction, and obtains the visible light reconstruction features. The decoder corresponding to the thermal infrared modality uses the multi-modal fusion image features to perform feature reconstruction, and obtains the thermal infrared reconstruction features.

示例性地，可见光重建特征可以是320*320*32的图像，热红外样本图像可以是320*320*32的图像。For example, the visible light reconstruction feature may be a 320*320*32 image, and the thermal infrared sample image may be a 320*320*32 image.

S130、确定可见光样本图像的可见光提取特征与可见光重建特征间的第一单模态相似损失、热红外样本图像的热红外提取特征与热红外重建特征间的第二单模态相似损失、热红外重建特征与可见光提取特征间的第一多模态交互损失、可见光重建特征与热红外提取特征间的第二多模态交互损失。S130. Determine the first single-modal similarity loss between the visible light extraction features and the visible light reconstruction features of the visible light sample image, the second single-modal similarity loss between the thermal infrared extraction features and the thermal infrared reconstruction features of the thermal infrared sample image, and the thermal infrared The first multi-modal interaction loss between the reconstructed features and the visible light extraction features, and the second multi-modal interaction loss between the visible light reconstruction features and the thermal infrared extraction features.

其中，可见光提取特征可以是通过卷积操作提取到的可见光样本图像的特征。热红外提取特征可以是通过卷积操作提取到的热红外样本图像的特征。单模态相似损失是一种用于训练模型的损失函数，它衡量了同一类别内不同样本之间的相似性，可以促进特征空间中相似样本的聚集。多模态交互损失是一种用于训练多模态深度学习模型的损失函数，通过结合不同模态数据(如不同类型的图像)的信息，实现跨模态交互和联合建模。本实施方式中的不同模态可以是可见光图像模态与热红外图像模态。The visible light extraction features may be features of visible light sample images extracted through convolution operations. Thermal infrared extraction features may be features of thermal infrared sample images extracted through convolution operations. Unimodal similarity loss is a loss function used to train models, which measures the similarity between different samples within the same category and can promote the clustering of similar samples in the feature space. Multimodal interaction loss is a loss function used to train multimodal deep learning models. It achieves cross-modal interaction and joint modeling by combining information from different modal data (such as different types of images). The different modes in this embodiment may be visible light image mode and thermal infrared image mode.

在一些情况下，在多变的复杂环境中，可见光样本图像和热红外样本图像两种模态的数据各有优劣。比如，在光照充足的环境下，可见光样本图像具有较好的实用性，而热红外样本图像可能对于融合特征的质量没有很大的提升。考虑到两种提取特征间的相互监督的重要性以及多模态数据存在互补性特征，所以通过优化交互监督损失函数，利用模型更有效地提取用于行人检测的多模态融合图像特征。通过单模态相似损失可以优化模型对可见光特征与热红外特征的提取。In some cases, in complex and changeable environments, the two modalities of data, visible light sample images and thermal infrared sample images, have their own advantages and disadvantages. For example, in an environment with sufficient lighting, visible light sample images have good practicality, while thermal infrared sample images may not greatly improve the quality of fused features. Considering the importance of mutual supervision between the two extracted features and the existence of complementary features in multi-modal data, the model is used to more effectively extract multi-modal fusion image features for pedestrian detection by optimizing the interactive supervision loss function. Single-modal similarity loss can be used to optimize the model's extraction of visible light features and thermal infrared features.

具体地，通过选择合适的单模态相似损失函数，基于可见光样本图像的可见光提取特征与可见光重建特征确定第一单模态相似损失。基于热红外样本图像的热红外提取特征与热红外重建特征确定第二单模态相似损失。通过选择合适的多模态交互损失函数，基于热红外重建特征与可见光提取特征确定第一多模态交互损失，基于可见光重建特征与热红外提取特征确定第二多模态交互损失。Specifically, by selecting an appropriate single-modal similarity loss function, the first single-modal similarity loss is determined based on the visible light extraction features and the visible light reconstruction features of the visible light sample image. The second single-modal similarity loss is determined based on the thermal infrared extraction features and thermal infrared reconstruction features of the thermal infrared sample image. By selecting an appropriate multi-modal interaction loss function, the first multi-modal interaction loss is determined based on thermal infrared reconstruction features and visible light extraction features, and the second multi-modal interaction loss is determined based on visible light reconstruction features and thermal infrared extraction features.

在一些实施方式中，可见光样本图像通过第一编码网络对应的卷积模块进行特征提取，可以得到可见光样本图像的可见光提取特征。热红外样本图像通过第二编码网络对应的卷积模块进行特征提取，可以得到热红外样本图像的热红外提取特征。可见光提取特征的大小可以是320*320*32，热红外提取特征的大小可以是320*320*32。In some implementations, the visible light sample image is subjected to feature extraction through the convolution module corresponding to the first encoding network, and the visible light extraction features of the visible light sample image can be obtained. The thermal infrared sample image is feature extracted through the convolution module corresponding to the second encoding network, and the thermal infrared extraction features of the thermal infrared sample image can be obtained. The size of visible light extraction features can be 320*320*32, and the size of thermal infrared extraction features can be 320*320*32.

S140、根据第一单模态相似损失、第二单模态相似损失、第一多模态交互损失以及第二多模态交互损失，对初始行人检测模型同时进行单模态监督训练和跨模态监督训练，直至满足模型停止训练条件，得到目标行人检测模型。S140. According to the first single-modal similarity loss, the second single-modal similarity loss, the first multi-modal interaction loss, and the second multi-modal interaction loss, simultaneously perform single-modal supervised training and cross-modal training on the initial pedestrian detection model. State-supervised training is performed until the model stop training conditions are met, and the target pedestrian detection model is obtained.

其中，单模态监督训练是一种机器学习方法，单模态监督训练使用来自单一数据源的标记数据进行模型训练。该方法通常应用于只有一个类型的输入数据(例如图像)，并可以通过监督学习算法来识别和分类各种模式和特征。跨模态监督训练是指在一个模态(如一种类型图像)中使用的标签信息来辅助另一个模态(如另一种类型图像)的学习。通过这种方式，可以在缺乏足够带标签数据的情况下提高模型性能，并且使得模型能够从多个模态中学习更全面的特征。Among them, single-modal supervised training is a machine learning method that uses labeled data from a single data source for model training. This method is typically applied to input data of only one type (e.g., images) and can be used to identify and classify various patterns and features through supervised learning algorithms. Cross-modal supervised training refers to using label information in one modality (such as one type of image) to assist the learning of another modality (such as another type of image). In this way, model performance can be improved in the absence of sufficient labeled data, and the model can learn more comprehensive features from multiple modalities.

在一些情况下，通过多模态交互损失对初始行人检测模型进行训练，可以获取到质量更高的多模态融合图像特征，将质量提高的多模态融合图像特征用于行人检测工作，可以提高行人检测的准确率。通过单模态相似损失可以提高可见光提取特征与热红外提取特征的质量。进一步地，通过提高可见光提取特征与热红外提取特征的质量，可以提升多模态融合图像特征的质量。In some cases, by training the initial pedestrian detection model through multi-modal interaction loss, higher-quality multi-modal fusion image features can be obtained, and the improved-quality multi-modal fusion image features can be used for pedestrian detection work. Improve the accuracy of pedestrian detection. The quality of visible light extraction features and thermal infrared extraction features can be improved through single-modal similarity loss. Furthermore, by improving the quality of visible light extraction features and thermal infrared extraction features, the quality of multi-modal fusion image features can be improved.

具体地，基于第一单模态相似损失、第二单模态相似损失对初始行人检测模型进行参数的更新，以实现单模态监督训练。基于第一多模态交互损失、第二多模态交互损失对初始行人检测模型进行参数的更新，以实现跨模态监督训练。以此类推，继续对更新后的初始行人检测模型进行训练，当达到模型训练停止条件时，可以得到目标行人检测模型。其中，模型训练停止条件可以是模型损失数据趋于收敛，也可以是训练轮次达到预设的轮次数量。Specifically, the parameters of the initial pedestrian detection model are updated based on the first single-modal similarity loss and the second single-modal similarity loss to achieve single-modal supervised training. The parameters of the initial pedestrian detection model are updated based on the first multi-modal interaction loss and the second multi-modal interaction loss to achieve cross-modal supervised training. By analogy, continue to train the updated initial pedestrian detection model. When the model training stop condition is reached, the target pedestrian detection model can be obtained. Among them, the stopping condition of model training can be that the model loss data tends to converge, or it can be that the training rounds reach a preset number of rounds.

上述实施方式中，首先，获取针对目标行人场景的可见光样本图像和热红外样本图像；接着，基于可见光样本图像的特征和热红外样本图像的特征进行逐元素相加操作得到多模态融合图像特征，并基于多模态融合图像特征进行特征重建，得到可见光重建特征和热红外重建特征；最后，根据可见光样本图像的可见光提取特征与可见光重建特征间的第一单模态相似损失、热红外样本图像的热红外提取特征与热红外重建特征间的第二单模态相似损失、热红外重建特征与可见光提取特征间的第一多模态交互损失、可见光重建特征与热红外提取特征间的第二多模态交互损失，对初始行人检测模型同时进行单模态监督训练和跨模态监督训练，以得到目标行人检测模型。一方面，通过逐元素相加操作实现特征初步融合，得到多模态融合图像特征，减少可见光图像与热红外图像两种模态特征之间的相互干扰；另一方面，通过第一单模态相似损失、第二单模态相似损失、第一多模态交互损失以及第二多模态交互损失这四种损失对模型参数优化更新，使得模型具备获得更准确的多模态融合图像特征的能力，从而可以提高多模态融合图像特征的质量，进而提高行人检测结果的准确性。In the above embodiment, first, a visible light sample image and a thermal infrared sample image for the target pedestrian scene are obtained; then, based on the characteristics of the visible light sample image and the characteristics of the thermal infrared sample image, an element-by-element addition operation is performed to obtain the multi-modal fusion image features , and perform feature reconstruction based on multi-modal fusion image features to obtain visible light reconstruction features and thermal infrared reconstruction features; finally, based on the first single-modal similarity loss between the visible light extraction features and visible light reconstruction features of the visible light sample image, the thermal infrared sample The second single-modal similarity loss between the thermal infrared extraction features and the thermal infrared reconstruction features of the image, the first multi-modal interaction loss between the thermal infrared reconstruction features and the visible light extraction features, and the third multi-modal interaction loss between the visible light reconstruction features and the thermal infrared extraction features. Second, multi-modal interaction loss is used to perform single-modal supervised training and cross-modal supervised training on the initial pedestrian detection model at the same time to obtain the target pedestrian detection model. On the one hand, preliminary feature fusion is achieved through element-by-element addition operations to obtain multi-modal fusion image features, reducing the mutual interference between the two modal features of visible light images and thermal infrared images; on the other hand, through the first single modal The four losses of similarity loss, second single-modal similarity loss, first multi-modal interaction loss and second multi-modal interaction loss optimize and update the model parameters, making the model capable of obtaining more accurate multi-modal fusion image features. ability, which can improve the quality of multi-modal fusion image features, thereby improving the accuracy of pedestrian detection results.

在一些实施方式中，请参阅图2a，初始行人检测模型包括并联的第一编码网络和第二编码网络，第一编码网络和第二编码网络共同连接至融合组件，融合组件连接有并联的第一解码网络和第二解码网络。基于可见光样本图像和热红外样本图像间的多模态融合图像特征进行特征重建，得到可见光重建特征和热红外重建特征，可以包括以下步骤：In some embodiments, please refer to Figure 2a. The initial pedestrian detection model includes a parallel first encoding network and a second encoding network. The first encoding network and the second encoding network are jointly connected to a fusion component. The fusion component is connected to a parallel third encoding network. a decoding network and a second decoding network. Feature reconstruction is performed based on the multi-modal fusion image features between visible light sample images and thermal infrared sample images to obtain visible light reconstruction features and thermal infrared reconstruction features, which may include the following steps:

S210、将可见光样本图像输入至第一编码网络中进行特征提取，得到可见光模态特征。S210. Input the visible light sample image into the first encoding network for feature extraction to obtain visible light modal features.

S220、将热红外样本图像输入至第二编码网络中进行特征提取，得到热红外模态特征。S220. Input the thermal infrared sample image into the second encoding network for feature extraction to obtain thermal infrared modal features.

其中，编码网络是一种深度学习神经网络，用于将高维输入数据(如图像、音频或文本)转换为低维表示，以便更有效地进行分析和处理。Among them, the encoding network is a deep learning neural network used to convert high-dimensional input data (such as images, audio, or text) into low-dimensional representations for more effective analysis and processing.

在一些情况下，并联的第一编码网络和第二编码网络可以是双路特征提取网络，双路特征提取网络是一种神经网络结构，在其中有两条输入路径，每条路径都包含相同或不同的特征提取层，最终将两个路径的输出逐像素相加起来形成最终的网络输出。这种结构可以有效地初步融合不同类型的特征，减少不同模态间特征的干扰，提高模型的性能和鲁棒性。In some cases, the parallel first encoding network and the second encoding network can be a dual-path feature extraction network. The dual-path feature extraction network is a neural network structure in which there are two input paths, each path contains the same or different feature extraction layers, and finally the outputs of the two paths are added pixel by pixel to form the final network output. This structure can effectively initially fuse different types of features, reduce the interference of features between different modalities, and improve the performance and robustness of the model.

具体地，将可见光样本图像输入至第一编码网络中进行特征提取，通过第一编码网络中的卷积操作，可以提取可见光样本图像模态的特征，得到可见光模态特征。将热红外样本图像输入至第二编码网络中进行特征提取，通过第二编码网络中的卷积操作，可以提取热红外样本图像模态的特征，得到热红外模态特征。Specifically, the visible light sample image is input into the first coding network for feature extraction. Through the convolution operation in the first coding network, the features of the visible light sample image modality can be extracted to obtain the visible light modality features. The thermal infrared sample image is input into the second encoding network for feature extraction. Through the convolution operation in the second encoding network, the modal features of the thermal infrared sample image can be extracted to obtain the thermal infrared modal features.

示例性地，初始行人检测模型可以是基于YOLOv5框架提出了一种可见光样本图像和热红外样本图像的跨模态监督模型(Cross-Modal Supervision，CMS)。参阅图2b，图2b为第一编码网络的结构，第一编码网络210的结构由CBS模块(Content-Based Switching，基于内容的切换网络)、CSP1_X模块(Cross Stage Partial Network，跨阶段局部网络)、CSP2_X模块、SPPF模块组成(Fast Spatial Pyramid Pooling Mudule，快速空间金字塔池化模块)。参阅图2c，CBS模块220是由Conv模块(Standard convolution mudule，标准卷积模块)、BN模块(Batch Normalization，批量归一化模块)以及SILU模块(Sigmoid-WeightedLinear Unit，自适应线性整流单元)组成，SILU激活函数是swish激活函数的变体，SILU激活函数的公式为SILU(X)＝X.sigmoid(X)。参阅图2d，CSP1_X模块230由CBS模块、Res unit模块(Residual Unit，残差单元)、Concat模块(Concatenation，连接模块)、BN模块以及Silu模块组成。参阅图2e，Res unit模块240由CBS模块和ADD模块(Address Decoder，地址解码器)组成。参阅图2f，CSP2_X模块250由CBS模块、Concat模块(Concatenation，连接模块)、BN模块以及Silu模块组成。参阅图2g，SPPF模块260由CBS模块、Maxpool模块(MaximumPooling，最大池化模块)以及Concat模块(Concatenation，连接模块)组成。需要说明的是，第二编码网络的结构组成与第一编码网络结构的组成是一致的。For example, the initial pedestrian detection model may be a cross-modal supervision model (Cross-Modal Supervision, CMS) of visible light sample images and thermal infrared sample images based on the YOLOv5 framework. Refer to Figure 2b. Figure 2b shows the structure of the first encoding network. The structure of the first encoding network 210 consists of a CBS module (Content-Based Switching, content-based switching network) and a CSP1_X module (Cross Stage Partial Network). , CSP2_X module, SPPF module (Fast Spatial Pyramid Pooling Mudule, fast spatial pyramid pooling module). Referring to Figure 2c, the CBS module 220 is composed of a Conv module (Standard convolution mudule, standard convolution module), a BN module (Batch Normalization, batch normalization module), and a SILU module (Sigmoid-WeightedLinear Unit, adaptive linear rectification unit). , the SILU activation function is a variant of the swish activation function, and the formula of the SILU activation function is SILU(X)=X.sigmoid(X). Referring to Figure 2d, the CSP1_X module 230 is composed of a CBS module, a Res unit module (Residual Unit, residual unit), a Concat module (Concatenation, a connection module), a BN module, and a Silu module. Referring to Figure 2e, the Res unit module 240 is composed of a CBS module and an ADD module (Address Decoder). Referring to Figure 2f, the CSP2_X module 250 is composed of a CBS module, a Concat module (Concatenation, connection module), a BN module and a Silu module. Referring to Figure 2g, the SPPF module 260 is composed of a CBS module, a Maxpool module (MaximumPooling, a maximum pooling module), and a Concat module (Concatenation, a connection module). It should be noted that the structural composition of the second encoding network is consistent with the structural composition of the first encoding network.

示例性地，可以将640*640*3像素大小的可见光样本图像输入至第一编码网络中进行特征提取，得到16*16*512像素大小的可见光模态特征。可以将640*640*1像素大小的热红外样本图像输入至第二编码网络中进行特征提取，得到16*16*512元素大小的热红外模态特征。For example, a visible light sample image with a size of 640*640*3 pixels can be input into the first encoding network for feature extraction, and a visible light modality feature with a size of 16*16*512 pixels can be obtained. The thermal infrared sample image with a size of 640*640*1 pixels can be input into the second encoding network for feature extraction, and a thermal infrared modal feature with a size of 16*16*512 elements can be obtained.

S230、通过融合组件对可见光模态特征与热红外模态特征进行逐元素相加操作，得到多模态融合图像特征。S230: Perform an element-by-element addition operation on the visible light modal features and the thermal infrared modal features through the fusion component to obtain multi-modal fusion image features.

在一些情况下，通过初步融合可见光模态特征与热红外模态特征，可以提高行人检测模型的预测性能，并且融合不同来源的特征，可以减少过拟合的可能性，增加行人检测模型的泛化能力。需要说明的是，本实施方式中多模态融合图像特征是简单初步的多模态相加融合特征，即通过可见光模态特征与热红外模态特征进行逐元素相加得到的。In some cases, the prediction performance of the pedestrian detection model can be improved by initially fusing visible light modal features and thermal infrared modal features, and fusing features from different sources can reduce the possibility of overfitting and increase the generalizability of the pedestrian detection model. ization ability. It should be noted that the multi-modal fusion image features in this embodiment are simple and preliminary multi-modal additive fusion features, which are obtained by element-by-element addition of visible light modal features and thermal infrared modal features.

具体地，通过融合组件，将处于可见光模态特征位置的元素与处于热红外模态特征相同位置的元素/>进行相加操作，实现对可见光模态特征与热红外模态特征的融合，以得到多模态融合图像特征U。Specifically, by fusing components, elements at the characteristic positions of the visible light modality Elements in the same position as the thermal infrared modal signature /> An addition operation is performed to fuse the visible light modal features and the thermal infrared modal features to obtain the multi-modal fusion image feature U.

示例性地，请参阅图2h，可见光模态特征270可以包括四个元素，分别是A₁、A₂、A₃、A₄。A₁、A₂、A₃、A₄的元素值分别是35、67、24、48。热红外模态特征280可以包括四个元素，分别是B₁、B₂、B₃、B₄。B₁、B₂、B₃、B₄的元素值分别是15、29、7、36。将可见光模态特征的元素A₁的元素值35与热红外模态特征的元素B₁的元素值15进行元素值的相加可以得到多模态融合图像特征290的元素C₁的元素值50。将可见光模态特征的元素A₂的元素值67与热红外模态特征的元素B₂的元素值29进行元素值的相加可以得到多模态融合图像特征290的元素C₂的元素值96。将可见光模态特征的元素A₃的元素值24与热红外模态特征的元素B₃的元素值7进行元素值的相加可以得到多模态融合图像特征290的元素C₃的元素值31。将可见光模态特征的元素A₄的元素值48与热红外模态特征的元素B₄的元素值36进行元素值的相加可以得到多模态融合图像特征230的元素C₄的元素值84。For example, referring to FIG. 2h, the visible light modality feature 270 may include four elements, namely A ₁ , A ₂ , A ₃ , and A ₄ . The element values of A ₁ , A ₂ , A ₃ and A ₄ are 35, 67, 24 and 48 respectively. The thermal infrared modal signature 280 may include four elements, namely B ₁ , B ₂ , B ₃ , and B ₄ . The element values of B ₁ , B ₂ , B ₃ and B ₄ are 15, 29, 7 and 36 respectively. Adding the element value 35 of element A ₁ of the visible light modality feature and the element value 15 of element B ₁ of the thermal infrared modality feature can obtain the element value 50 of element C ₁ of the multi-modal fusion image feature 290 . Adding the element value 67 of element A ₂ of the visible light modality feature and the element value 29 of element B ₂ of the thermal infrared modality feature can obtain the element value 96 of element C ₂ of the multi-modal fusion image feature 290 . Adding the element value 24 of element A ₃ of the visible light modality feature and the element value 7 of element B ₃ of the thermal infrared modality feature can obtain the element value 31 of element C ₃ of the multi-modal fusion image feature 290 . The element value 48 of element A ₄ of the visible light modality feature and the element value 36 of element B ₄ of the thermal infrared modality feature are added together to obtain the element value 84 of element C ₄ of the multi-modal fusion image feature 230 .

在一些实施方式中，参阅图2i，融合组件可以是Fusion模块(融合模块)。将可见光模态特征202与热红外模态特征204输入至Fusion模块206，通过Fusion模块206可以将可见光模态特征与热红外模态特征进行逐像素相加操作，得到多模态融合图像特征208。In some implementations, referring to Figure 2i, the fusion component may be a Fusion module. The visible light modal features 202 and the thermal infrared modal features 204 are input to the Fusion module 206. Through the Fusion module 206, the visible light modal features and the thermal infrared modal features can be added pixel by pixel to obtain the multi-modal fusion image feature 208. .

S240、将多模态融合图像特征输入至第一解码网络和第二解码网络中分别进行特征重建，对应得到可见光重建特征和热红外重建特征。S240. Input the multi-modal fusion image features into the first decoding network and the second decoding network for feature reconstruction respectively, and correspondingly obtain visible light reconstruction features and thermal infrared reconstruction features.

在一些情况下，通过第一解码网络和第二解码网络，实现对多模态融合图像特征的特征重建，可以准确地识别可见光模态特征与热红外模态特征间的差异，提高特征提取的质量。In some cases, feature reconstruction of multi-modal fused image features is achieved through the first decoding network and the second decoding network, and the difference between visible light modal features and thermal infrared modal features can be accurately identified, improving the efficiency of feature extraction. quality.

具体地，将逐像素相加操作得到的多模态融合图像特征输入至第一解码网络中，经过第一解码网络进行上采样、卷积等操作，得到可见光重建特征。将逐像素相加操作得到的多模态融合图像特征输入至第二解码网络中，经过第二解码网络进行上采样、卷积等操作，得到热红外重建特征。Specifically, the multi-modal fusion image features obtained by the pixel-by-pixel addition operation are input into the first decoding network, and up-sampling, convolution and other operations are performed through the first decoding network to obtain visible light reconstruction features. The multi-modal fusion image features obtained by the pixel-by-pixel addition operation are input into the second decoding network, and through the second decoding network, upsampling, convolution and other operations are performed to obtain thermal infrared reconstruction features.

示例性地，请参阅图2j，第一解码网络212可以由CBS模块和上采样模块组成。第二解码网络结构的组成与第一解码网络结构的组成相同。将逐像素相加操作得到的多模态融合图像特征输入至第一解码网络，得到可见光重建特征。将逐像素相加操作得到的多模态融合图像特征输入至第二解码网络，得到热红外重建特征。For example, referring to Figure 2j, the first decoding network 212 may be composed of a CBS module and an upsampling module. The composition of the second decoding network structure is the same as the composition of the first decoding network structure. The multi-modal fusion image features obtained by the pixel-by-pixel addition operation are input to the first decoding network to obtain visible light reconstruction features. The multi-modal fusion image features obtained by the pixel-by-pixel addition operation are input to the second decoding network to obtain thermal infrared reconstruction features.

上述实施方式中，通过确定可见光模态特征、热红外模态特征、可见光重建特征、热红外重建特征可以确定初始行人检测模型的单模态相似损失以及多模态交互损失，从而优化初始行人检测模型特征提取的参数，以提高特征提取的质量。In the above embodiment, by determining the visible light modal features, thermal infrared modal features, visible light reconstruction features, and thermal infrared reconstruction features, the single-modal similarity loss and multi-modal interaction loss of the initial pedestrian detection model can be determined, thereby optimizing the initial pedestrian detection. Parameters for model feature extraction to improve the quality of feature extraction.

在一些实施方式中，请参阅图3，多模态融合图像特征的确定方式，可以包括以下步骤：In some implementations, please refer to Figure 3. The method of determining multi-modal fusion image features may include the following steps:

S310、获取可见光样本图像的可见光模态特征和热红外样本图像的热红外模态特征。S310. Obtain the visible light modal characteristics of the visible light sample image and the thermal infrared modal characteristics of the thermal infrared sample image.

S320、对热红外模态特征和可见光模态特征进行逐元素相加操作，得到多模态融合图像特征。S320. Perform an element-by-element addition operation on the thermal infrared modal features and the visible light modal features to obtain multi-modal fusion image features.

具体地，通过对可见光样本图像进行特征提取，得到可见光模态特征。通过对热红外样本图像进行特征提取，得到热红外模态特征。接着，可以将处于可见光模态特征位置的元素与处于热红外模态特征相同位置的元素进行相加操作，可以实现可见光模态特征与热红外模态特征间的逐元素相加操作，以得到多模态融合图像特征。Specifically, visible light modal features are obtained by performing feature extraction on visible light sample images. By performing feature extraction on the thermal infrared sample image, the thermal infrared modal features are obtained. Then, the element at the position of the visible light modal feature and the element at the same position of the thermal infrared modal feature can be added, and the element-by-element addition operation between the visible light modal feature and the thermal infrared modal feature can be realized to obtain Multimodal fusion image features.

上述实施方式中，获取可见光样本图像的可见光模态特征和热红外样本图像的热红外模态特征，对热红外模态特征和可见光模态特征进行逐元素相加操作，得到多模态融合图像特征。通过融合可见光模态特征与热红外模态特征，可以提高行人检测模型的预测性能，并且融合不同来源的特征，可以减少过拟合的可能性，增加行人检测模型的泛化能力。In the above embodiment, the visible light modal features of the visible light sample image and the thermal infrared modal features of the thermal infrared sample image are obtained, and the thermal infrared modal features and the visible light modal features are added element by element to obtain a multi-modal fusion image. feature. By fusing visible light modal features and thermal infrared modal features, the prediction performance of the pedestrian detection model can be improved, and fusing features from different sources can reduce the possibility of overfitting and increase the generalization ability of the pedestrian detection model.

在一些实施方式中，请参阅图4，行人检测模型的训练方法还可以包括以下步骤：In some implementations, please refer to Figure 4, the training method of the pedestrian detection model may also include the following steps:

S410、获取针对目标行人场景进行拍摄得到的初始可见光图像和初始热红外图像。S410. Obtain the initial visible light image and the initial thermal infrared image captured for the target pedestrian scene.

S420、按照初始行人检测模型的输入数据尺寸对初始可见光图像和初始热红外图像进行预处理，得到可见光样本图像和热红外样本图像。S420. Preprocess the initial visible light image and the initial thermal infrared image according to the input data size of the initial pedestrian detection model to obtain a visible light sample image and a thermal infrared sample image.

其中，将归一化处理后的可见光样本图像和归一化处理后的热红外样本图像作为初始行人检测模型的输入数据。Among them, the normalized visible light sample image and the normalized thermal infrared sample image are used as input data of the initial pedestrian detection model.

在一些情况下，将初始可见光图像和初始热红外图像的像素数据进行归一化处理，使其落在特定的范围内，通常为0到1之间。通过归一化处理可以消除不同变量之间的量纲差异，以便更好地进行比较与分析，也可以节约计算资源的消耗。In some cases, the pixel data of the initial visible light image and the initial thermal infrared image are normalized so that they fall within a specific range, usually between 0 and 1. Through normalization processing, the dimensional differences between different variables can be eliminated for better comparison and analysis, and the consumption of computing resources can also be saved.

具体地，通过图像采集设备对真实应用场景下的目标行人场景进行拍摄，可以得到初始可见光图像和初始热红外图像。初始可见光图像和初始热红外图像是成对出现的，每一目标行人场景可以有至少一对初始可见光图像和初始热红外图像。按照初始行人检测模型的输入数据尺寸，对初始可见光样本图像和初始热红外样本图像进行图像预处理(比如裁剪或者等比例缩放等)的操作，可以获取到针对目标行人场景的可见光样本图像和热红外样本图像。Specifically, the initial visible light image and the initial thermal infrared image can be obtained by photographing the target pedestrian scene in a real application scenario through an image acquisition device. The initial visible light image and the initial thermal infrared image appear in pairs, and each target pedestrian scene can have at least one pair of initial visible light image and initial thermal infrared image. According to the input data size of the initial pedestrian detection model, image preprocessing (such as cropping or scaling) is performed on the initial visible light sample image and the initial thermal infrared sample image to obtain the visible light sample image and thermal image of the target pedestrian scene. Infrared sample image.

示例性地，初始行人检测模型的输入数据尺寸可以是640*640。可以通过等比例缩放或者裁剪的方式，对始可见光图像和初始热红外图像进行预处理操作。通过预处理将始可见光图像和初始热红外图像的图像尺寸转换为640*640。For example, the input data size of the initial pedestrian detection model may be 640*640. The initial visible light image and the initial thermal infrared image can be preprocessed by proportional scaling or cropping. The image size of the initial visible light image and the initial thermal infrared image is converted to 640*640 through preprocessing.

上述实施方式中，获取针对目标行人场景进行拍摄得到的初始可见光图像和初始热红外图像，按照初始行人检测模型的输入数据尺寸对初始可见光图像和初始热红外图像进行预处理，得到可见光样本图像和热红外样本图像。通过对初始可见光图像和初始热红外图像进行预处理，可以为后续进行特征提取提供输入数据。In the above embodiment, the initial visible light image and the initial thermal infrared image captured for the target pedestrian scene are obtained, and the initial visible light image and the initial thermal infrared image are preprocessed according to the input data size of the initial pedestrian detection model to obtain the visible light sample image and Thermal infrared sample image. By preprocessing the initial visible light image and the initial thermal infrared image, input data can be provided for subsequent feature extraction.

在一些实施方式中，采用以下方式对可见光样本图像和热红外样本图像进行归一化处理：In some embodiments, the visible light sample image and the thermal infrared sample image are normalized in the following manner:

根据可见光样本图像对应的第一通道均值以及第一通道标准差，对可见光样本图像进行归一化处理，得到可见光样本图像对应的输入数据。According to the first channel mean and the first channel standard deviation corresponding to the visible light sample image, the visible light sample image is normalized to obtain input data corresponding to the visible light sample image.

根据热红外样本图像对应的第二通道均值以及第二通道标准差，对热红外样本图像进行归一化处理，得到热红外样本图像对应的输入数据。According to the second channel mean value and the second channel standard deviation corresponding to the thermal infrared sample image, the thermal infrared sample image is normalized to obtain input data corresponding to the thermal infrared sample image.

其中，通道均值可以是任一通道上的所有像素的像素值的总和除以像素的数量得到的。通道标准差可以是任一通道中像素的离散程度的度量，通道标准差衡量了每个像素的像素值和平均数之间的差异，并且是方差的平方根。标准差越大，表示像素的像素值越分散，反之亦然。The channel mean can be the sum of the pixel values of all pixels on any channel divided by the number of pixels. Channel standard deviation can be a measure of the dispersion of pixels in any one channel. Channel standard deviation measures the difference between the pixel value of each pixel and the mean, and is the square root of the variance. The larger the standard deviation, the more spread out the pixel values are, and vice versa.

具体地，可见光样本图像具有R通道、G通道、B通道。针对RGB三通道中的任一通道，利用该任一通道上的像素数据先减去该任一通道上的第一通道均值，再除以该任一通道上的第一通道标准差，可以得到可见光样本图像在该任一通道上对应的输入数据。针对热红外样本图像，利用热红外样本图像的像素数据先减去第二通道均值，再除以第二通道标准差，可以得到热红外样本图像对应的输入数据。Specifically, the visible light sample image has an R channel, a G channel, and a B channel. For any one of the three RGB channels, use the pixel data on any channel to first subtract the mean value of the first channel on any channel, and then divide it by the standard deviation of the first channel on any channel, we can get The input data corresponding to the visible light sample image on any channel. For the thermal infrared sample image, the input data corresponding to the thermal infrared sample image can be obtained by first subtracting the second channel mean value from the pixel data of the thermal infrared sample image, and then dividing it by the second channel standard deviation.

上述实施方式中，通过归一化处理，可以将可以消除不同变量之间的量纲差异，以便更好地进行比较与分析，也可以节约计算资源的消耗。In the above embodiment, through normalization processing, dimensional differences between different variables can be eliminated to facilitate better comparison and analysis, and can also save the consumption of computing resources.

在一些实施方式中，采用以下损失函数计算模型损失数据：In some implementations, the following loss function is used to calculate model loss data:

L＝L1+L2+L3+L4L＝L1+L2+L3+L4

L1＝MSE(RGB_i，RGB′_i)L1=MSE( _RGBi , _RGB′i )

L2＝MSE(TH_i，TH′_i)L2=MSE (TH _i , TH′ _i )

L3＝MSE(RGB_i，TH′_i)L3=MSE( _RGBi , _TH′i )

L4＝MSE(TH_i，RGB′_i)L4=MSE (TH _i , RGB′ _i )

其中，均方差损失是预测数据和原始数据对应点误差的平方和的均值。Among them, the mean square error loss is the average sum of the squared errors of the corresponding points in the predicted data and the original data.

具体地，将可见光重建特征和可见光提取特征进行相似度比对，使用均方差损失函数MSE，迭代初始行人检测模型的参数，最小化可见光重建特征和可见光提取特征间的损失，可以得到更加有效的融合特征。可见光重建特征和可见光提取特征间的第一单模态相似损失公式如下：Specifically, the visible light reconstruction features and the visible light extraction features are compared for similarity, and the mean square error loss function MSE is used to iterate the parameters of the initial pedestrian detection model to minimize the loss between the visible light reconstruction features and the visible light extraction features, and a more effective algorithm can be obtained. Fusion features. The first single-modal similarity loss formula between visible light reconstruction features and visible light extraction features is as follows:

L1＝MSE(RGB_i，RGB′_i)L1=MSE( _RGBi , _RGB′i )

将热红外重建特征和热红外提取特征进行相似度比对，使用均方差损失函数MSE，迭代初始行人检测模型的参数，最小化热红外重建特征和热红外提取特征间的损失，可以得到更加有效的融合特征。热红外重建特征和热红外提取特征间的第二单模态相似损失公式如下：Compare the similarity between thermal infrared reconstruction features and thermal infrared extraction features, use the mean square error loss function MSE, iterate the parameters of the initial pedestrian detection model, and minimize the loss between the thermal infrared reconstruction features and thermal infrared extraction features, which can be more effective. fusion characteristics. The second single-modal similarity loss formula between thermal infrared reconstructed features and thermal infrared extracted features is as follows:

L2＝MSE(TH_i，TH′_i)L2=MSE (TH _i , TH′ _i )

将可见光提取特征和热红外重建特征进行交互监督，两种模态数据特征进行损失计算。使用均方差损失函数MSE，迭代初始行人检测模型的参数，最小化可见光提取特征和热红外重建特征间的损失，可以得到更加有效的融合特征。可见光提取特征和热红外重建特征间的第一多模态交互损失公式如下：Visible light extraction features and thermal infrared reconstruction features are interactively supervised, and the two modal data features are used for loss calculation. Using the mean square error loss function MSE to iterate the parameters of the initial pedestrian detection model and minimize the loss between visible light extraction features and thermal infrared reconstruction features, more effective fusion features can be obtained. The first multimodal interaction loss formula between visible light extraction features and thermal infrared reconstruction features is as follows:

L3＝MSE(RGB_i，TH′_i)L3=MSE( _RGBi , _TH′i )

将热红外提取特征和可见光重建特征进行交互监督，两种模态数据特征进行损失计算。使用均方差损失函数MSE，迭代初始行人检测模型的参数，最小化热红外提取特征和可见光重建特征间的损失，可以得到更加有效的融合特征。热红外提取特征和可见光重建特征间的第二多模态交互损失公式如下：Thermal infrared extraction features and visible light reconstruction features are interactively supervised, and the two modal data features are used for loss calculation. Using the mean square error loss function MSE to iterate the parameters of the initial pedestrian detection model to minimize the loss between thermal infrared extraction features and visible light reconstruction features, more effective fusion features can be obtained. The second multimodal interaction loss formula between thermal infrared extraction features and visible light reconstruction features is as follows:

L4＝MSE(TH_i，RGB′_i)L4=MSE (TH _i , RGB′ _i )

将单模态相似损失和多模态交互损失进行相加操作，可以得到模型损失数据，迭代优化损失，以得到目标行人检测模型。模型损失数据的公式如下：By adding the single-modal similarity loss and the multi-modal interaction loss, the model loss data can be obtained, and the loss can be iteratively optimized to obtain the target pedestrian detection model. The formula for model loss data is as follows:

L＝L1+L2+L3+L4L＝L1+L2+L3+L4

上述实施方式中，通过确定可见光样本图像的可见光提取特征与可见光重建特征间的第一单模态相似损失、热红外样本图像的热红外提取特征与热红外重建特征间的第二单模态相似损失，可以优化可见光提取特征与热红外提取特征。通过确定热红外重建特征与可见光提取特征间的第一多模态交互损失、可见光重建特征与热红外提取特征间的第二多模态交互损失，可以提高融合特征的质量，从而提高行人检测结果的准确性。In the above embodiment, by determining the first single-modal similarity loss between the visible light extraction features and the visible light reconstruction features of the visible light sample image, and the second single-modal similarity between the thermal infrared extraction features and the thermal infrared reconstruction features of the thermal infrared sample image. Loss, visible light extraction features and thermal infrared extraction features can be optimized. By determining the first multi-modal interaction loss between thermal infrared reconstruction features and visible light extraction features, and the second multi-modal interaction loss between visible light reconstruction features and thermal infrared extraction features, the quality of the fused features can be improved, thereby improving pedestrian detection results. accuracy.

本说明书实施方式提供一种行人检测方法，请参阅图5，该方法可以包括以下步骤：The embodiment of this specification provides a pedestrian detection method. Please refer to Figure 5. The method may include the following steps:

S510、获取针对任一行人场景拍摄得到的可见光待检图像和热红外待检图像。S510: Obtain the visible light image to be inspected and the thermal infrared image to be inspected that are captured for any pedestrian scene.

S520、将可见光待检图像以及可见光待检图像输入至上述任一项实施方式得到的目标行人检测模型中进行行人检测，得到行人检测结果。S520: Input the visible light image to be detected and the visible light image to be detected into the target pedestrian detection model obtained in any of the above embodiments to perform pedestrian detection and obtain a pedestrian detection result.

具体地，通过图像采集设备对任一行人场景进行拍摄，可以得到初始可见光待检图像和初始热红外待检图像。将初始可见光待检图像和初始热红外待检图像输入至目标行人检测模型，通过目标行人检测模型实现行人检测，得到行人检测结果。Specifically, by photographing any pedestrian scene through an image acquisition device, an initial visible light image to be inspected and an initial thermal infrared image to be inspected can be obtained. The initial visible light image to be detected and the initial thermal infrared image to be detected are input into the target pedestrian detection model, and pedestrian detection is implemented through the target pedestrian detection model to obtain the pedestrian detection results.

上述实施方式中，获取针对任一行人场景拍摄得到的可见光待检图像和热红外待检图像，将可见光待检图像以及可见光待检图像输入至上述任一项实施方式得到的目标行人检测模型中进行行人检测，得到行人检测结果。行人检测可以应用于车辆辅助驾驶系统、智能视频监控、机器人、航拍图像、人机交互系统、运动分析等方面。In the above embodiments, the visible light image to be inspected and the thermal infrared image to be inspected that are captured for any pedestrian scene are acquired, and the visible light image to be inspected and the visible light image to be inspected are input into the target pedestrian detection model obtained in any of the above embodiments. Perform pedestrian detection and obtain pedestrian detection results. Pedestrian detection can be applied to vehicle assisted driving systems, intelligent video surveillance, robots, aerial images, human-computer interaction systems, motion analysis, etc.

本说明书实施方式提供一种行人检测方法，请参阅图6，该方法可以包括以下步骤：The embodiment of this specification provides a pedestrian detection method. Please refer to Figure 6. The method may include the following steps:

S610、获取针对任一行人场景拍摄得到的可见光待检图像和热红外待检图像。S610: Obtain the visible light image to be inspected and the thermal infrared image to be inspected that are captured for any pedestrian scene.

S620、将可见光待检图像以及可见光待检图像输入至目标行人检测模型中进行行人检测，得到行人检测结果。S620: Input the visible light image to be detected and the visible light image to be detected into the target pedestrian detection model for pedestrian detection, and obtain pedestrian detection results.

其中，目标行人检测模型在训练过程的损失数据包括单模态损失和跨模态损失。单模态损失和跨模态损失一起用于监督训练目标行人检测模提取可见光待检图像和热红外待检图像间的多模态融合图像特征的能力。单模态损失包括可见光样本图像的可见光提取特征与可见光重建特征间的第一单模态相似损失，以及热红外样本图像的热红外提取特征与热红外重建特征间的第二单模态相似损失。跨模态损失包括热红外重建特征与可见光提取特征间的第一多模态交互损失、可见光重建特征与热红外提取特征间的第二多模态交互损失。Among them, the loss data of the target pedestrian detection model during the training process includes single-modal loss and cross-modal loss. Single-modal loss and cross-modal loss are used together to supervise the ability of the training target pedestrian detection model to extract multi-modal fusion image features between the visible light image to be detected and the thermal infrared image to be detected. The single-modal loss includes the first single-modal similarity loss between the visible light extraction features and the visible light reconstruction features of the visible light sample image, and the second single-modal similarity loss between the thermal infrared extraction features and the thermal infrared reconstruction features of the thermal infrared sample image. . The cross-modal loss includes the first multi-modal interaction loss between thermal infrared reconstruction features and visible light extraction features, and the second multi-modal interaction loss between visible light reconstruction features and thermal infrared extraction features.

具体地，通过图像采集设备对任一行人场景进行拍摄，可以得到初始可见光待检图像和初始热红外待检图像。将初始可见光待检图像和初始热红外待检图像进行图像预处理，可以得到与目标行人检测模型的输入数据尺寸相同的可见光待检图像和热红外待检图像。将经过图像预处理的可见光待检图像和热红外待检图像输入至目标行人检测模型，通过目标行人检测模型实现行人检测，得到行人检测结果。Specifically, by photographing any pedestrian scene through an image acquisition device, an initial visible light image to be inspected and an initial thermal infrared image to be inspected can be obtained. By performing image preprocessing on the initial visible light image to be detected and the initial thermal infrared image to be detected, the visible light image to be detected and the thermal infrared image to be detected can be obtained with the same size as the input data of the target pedestrian detection model. The visible light image to be inspected and the thermal infrared image to be inspected after image preprocessing are input to the target pedestrian detection model, and pedestrian detection is implemented through the target pedestrian detection model to obtain the pedestrian detection results.

上述实施方式中，获取针对任一行人场景拍摄得到的可见光待检图像和热红外待检图像，将可见光待检图像以及可见光待检图像输入至目标行人检测模型中进行行人检测，得到行人检测结果。行人检测可以应用于车辆辅助驾驶系统、智能视频监控、机器人、航拍图像、人机交互系统、运动分析等方面。In the above embodiment, the visible light image to be inspected and the thermal infrared image to be inspected that are captured for any pedestrian scene are acquired, and the visible light image to be inspected and the visible light image to be inspected are input into the target pedestrian detection model for pedestrian detection, and the pedestrian detection results are obtained. . Pedestrian detection can be applied to vehicle assisted driving systems, intelligent video surveillance, robots, aerial images, human-computer interaction systems, motion analysis, etc.

在一些实施方式中，请参阅图7，目标行人检测模型包括并联的第一编码网络和第二编码网络，第一编码网络和第二编码网络共同连接至融合组件。将可见光待检图像以及可见光待检图像输入至目标行人检测模型中进行行人检测，得到行人检测结果，可以包括以下步骤：In some implementations, please refer to FIG. 7 , the target pedestrian detection model includes a first encoding network and a second encoding network connected in parallel, and the first encoding network and the second encoding network are jointly connected to the fusion component. Input the visible light image to be detected and the visible light image to be detected into the target pedestrian detection model for pedestrian detection, and obtain the pedestrian detection results, which may include the following steps:

S710、将可见光待检图像输入至第一编码网络中进行特征提取，得到可见光待检特征。S710. Input the visible light image to be inspected into the first encoding network for feature extraction to obtain the visible light image to be inspected.

S720、将热红外待检图像输入至第二编码网络中进行特征提取，得到热红外待检特征。S720: Input the thermal infrared image to be inspected into the second encoding network for feature extraction to obtain the thermal infrared image to be inspected.

S730、通过融合组件对可见光待检特征与热红外待检特征进行逐元素相加操作，得到待检融合图像特征。S730: Use the fusion component to perform an element-by-element addition operation on the visible light features to be inspected and the thermal infrared features to be inspected, to obtain the fused image features to be inspected.

S740、基于待检融合图像特征进行目标检测，得到行人检测结果。S740: Perform target detection based on the fusion image features to be detected, and obtain pedestrian detection results.

具体地，将可见光待检图像输入至目标行人检测模型包括的第一编码网络中进行特征提取，得到可见光待检特征，将热红外待检图像输入至目标行人检测模型包括的第二编码网络中进行特征提取，得到热红外待检特征。通过目标行人检测模型包括的融合组件对可见光待检特征与热红外待检特征进行逐元素相加操作，得到待检融合图像特征。对待检融合图像特征进行目标检测，得到行人检测结果。Specifically, the visible light image to be detected is input into the first coding network included in the target pedestrian detection model for feature extraction to obtain the visible light features to be detected, and the thermal infrared image to be detected is input into the second coding network included in the target pedestrian detection model. Perform feature extraction to obtain thermal infrared features to be inspected. Through the fusion component included in the target pedestrian detection model, the visible light features to be detected and the thermal infrared features to be detected are added element by element to obtain the fused image features to be detected. Target detection is performed on the fused image features to be examined, and pedestrian detection results are obtained.

上述实施方式中，将可见光待检图像输入至第一编码网络中进行特征提取，得到可见光待检特征，将热红外待检图像输入至第二编码网络中进行特征提取，得到热红外待检特征，通过融合组件对可见光待检特征与热红外待检特征进行逐元素相加操作，得到待检融合图像特征，基于待检融合图像特征进行目标检测，得到行人检测结果。行人检测可以应用于车辆辅助驾驶系统、智能视频监控、机器人、航拍图像、人机交互系统、运动分析等方面。In the above embodiment, the visible light image to be inspected is input into the first encoding network for feature extraction to obtain the visible light inspection features, and the thermal infrared image to be inspected is input into the second encoding network for feature extraction to obtain the thermal infrared features to be inspected. , through the fusion component, the visible light features to be detected and the thermal infrared features to be detected are added element by element to obtain the fusion image features to be detected, and target detection is performed based on the fusion image features to be detected to obtain pedestrian detection results. Pedestrian detection can be applied to vehicle assisted driving systems, intelligent video surveillance, robots, aerial images, human-computer interaction systems, motion analysis, etc.

在一些实施方式中，基于待检融合图像特征进行行人检测，得到行人检测结果，可以包括：根据待检融合图像特征进行卷积、池化、激活处理操作，以进行目标检测，得到行人检测结果。In some embodiments, performing pedestrian detection based on the fused image features to be detected and obtaining the pedestrian detection results may include: performing convolution, pooling, and activation processing operations based on the fused image features to be detected to perform target detection and obtain the pedestrian detection results. .

具体地，将待检融合图像特征输入至卷积层中进行卷积处理，得到卷积处理结果。将卷积处理结果输入至池化层中进行池化处理，得到池化处理结果。通过激活层对池化处理结果进行激活处理，以进行目标检测，得到行人目标结果。示例性地，激活处理可以通过sigmoid函数实现。Specifically, the fusion image features to be detected are input into the convolution layer for convolution processing, and the convolution processing result is obtained. The convolution processing results are input into the pooling layer for pooling processing, and the pooling processing results are obtained. The pooling processing results are activated through the activation layer for target detection to obtain pedestrian target results. For example, activation processing can be implemented through a sigmoid function.

上述实施方式中，根据待检融合图像特征进行卷积、池化、激活处理操作，以进行目标检测，得到行人检测结果。行人检测可以应用于车辆辅助驾驶系统、智能视频监控、机器人、航拍图像、人机交互系统、运动分析等方面。In the above embodiment, convolution, pooling, and activation processing operations are performed according to the fused image features to be detected to perform target detection and obtain pedestrian detection results. Pedestrian detection can be applied to vehicle assisted driving systems, intelligent video surveillance, robots, aerial images, human-computer interaction systems, motion analysis, etc.

本说明书实施方式还提供一种行人检测模型的训练方法，初始行人检测模型包括并联的第一编码网络和第二编码网络，第一编码网络和第二编码网络共同连接至融合组件，融合组件连接有并联的第一解码网络和第二解码网络。示例性地，请参阅图8，该行人检测模型的训练方法可以包括以下步骤：The embodiment of this specification also provides a training method for a pedestrian detection model. The initial pedestrian detection model includes a parallel first encoding network and a second encoding network. The first encoding network and the second encoding network are jointly connected to the fusion component, and the fusion component is connected There is a first decoding network and a second decoding network connected in parallel. For example, please refer to Figure 8. The training method of the pedestrian detection model may include the following steps:

S802、获取针对目标行人场景进行拍摄得到的初始可见光图像和初始热红外图像。S802. Obtain the initial visible light image and the initial thermal infrared image captured for the target pedestrian scene.

S804、按照初始行人检测模型的输入数据尺寸对初始可见光图像和初始热红外图像进行预处理，得到可见光样本图像和热红外样本图像。S804. Preprocess the initial visible light image and the initial thermal infrared image according to the input data size of the initial pedestrian detection model to obtain a visible light sample image and a thermal infrared sample image.

S806、将可见光样本图像输入至第一编码网络中进行特征提取，得到可见光模态特征。S806. Input the visible light sample image into the first encoding network for feature extraction to obtain visible light modal features.

S808、将热红外样本图像输入至第二编码网络中进行特征提取，得到热红外模态特征。S808. Input the thermal infrared sample image into the second encoding network for feature extraction to obtain thermal infrared modal features.

S810、通过融合组件对可见光模态特征与热红外模态特征进行逐元素相加操作，得到多模态融合图像特征。S810: Perform an element-by-element addition operation on the visible light modal features and the thermal infrared modal features through the fusion component to obtain multi-modal fusion image features.

S812、将多模态融合图像特征输入至第一解码网络和第二解码网络中分别进行特征重建，对应得到可见光重建特征和热红外重建特征。S812. Input the multi-modal fusion image features into the first decoding network and the second decoding network for feature reconstruction respectively, and correspondingly obtain visible light reconstruction features and thermal infrared reconstruction features.

S814、确定可见光样本图像的可见光提取特征与可见光重建特征间的第一单模态相似损失、热红外样本图像的热红外提取特征与热红外重建特征间的第二单模态相似损失、热红外重建特征与可见光提取特征间的第一多模态交互损失、可见光重建特征与热红外提取特征间的第二多模态交互损失。S814. Determine the first single-modal similarity loss between the visible light extraction features and the visible light reconstruction features of the visible light sample image, the second single-modal similarity loss between the thermal infrared extraction features and the thermal infrared reconstruction features of the thermal infrared sample image, and the thermal infrared The first multi-modal interaction loss between the reconstructed features and the visible light extraction features, and the second multi-modal interaction loss between the visible light reconstruction features and the thermal infrared extraction features.

具体地，采用以下损失函数计算模型损失数据：Specifically, the following loss function is used to calculate the model loss data:

L＝L1+L2+L3+L4L＝L1+L2+L3+L4

L1＝MSE(RGB_i，RGB′_i)L1=MSE( _RGBi , _RGB′i )

L2＝MSE(TH_i，TH′_i)L2=MSE (TH _i , TH′ _i )

L3＝MSE(RGB_i，TH′_i)L3=MSE( _RGBi , _TH′i )

L4＝MSE(TH_i，RGB′_i)L4=MSE (TH _i , RGB′ _i )

其中，L为模型损失数据，L1为第一单模态相似损失，L2为第二单模态相似损失，L3为第一多模态交互损失，L4为第二多模态交互损失，RGB_i为可见光模态特征，RGB′_i为可见光重建特征，TH_i为热红外模态特征，TH′_i为热红外重建特征，i表示特征计数编号，MSE表示均方差损失。Among them, L is the model loss data, L1 is the first single-modal similarity loss, L2 is the second single-modal similarity loss, L3 is the first multi-modal interaction loss, L4 is the second multi-modal interaction loss, RGB _i is the visible light modal feature, RGB′ _i is the visible light reconstruction feature, TH _i is the thermal infrared modal feature, TH′ _i is the thermal infrared reconstruction feature, i represents the feature count number, and MSE represents the mean square error loss.

S816、根据第一单模态相似损失、第二单模态相似损失、第一多模态交互损失以及第二多模态交互损失，对初始行人检测模型同时进行单模态监督训练和跨模态监督训练，直至满足模型停止训练条件，得到目标行人检测模型。S816. According to the first single-modal similarity loss, the second single-modal similarity loss, the first multi-modal interaction loss, and the second multi-modal interaction loss, simultaneously perform single-modal supervised training and cross-modal training on the initial pedestrian detection model. State-supervised training is performed until the model stop training conditions are met, and the target pedestrian detection model is obtained.

本说明书实施方式提供一种行人检测模型的训练装置900，请参阅图9，行人检测模型的训练装置900包括：样本图像获取模块910、图像特征重建模块920、损失数据确定模块930、检测模型确定模块940。The embodiment of this specification provides a training device 900 for a pedestrian detection model. Please refer to Figure 9. The training device 900 for a pedestrian detection model includes: a sample image acquisition module 910, an image feature reconstruction module 920, a loss data determination module 930, and a detection model determination module. Module 940.

样本图像获取模块910，用于获取针对目标行人场景的可见光样本图像和热红外样本图像；The sample image acquisition module 910 is used to acquire visible light sample images and thermal infrared sample images for the target pedestrian scene;

图像特征重建模块920，用于基于所述可见光样本图像和所述热红外样本图像间的多模态融合图像特征进行特征重建，得到可见光重建特征和热红外重建特征；其中，所述多模态融合图像特征是基于所述可见光样本图像的特征和所述热红外样本图像的特征进行逐元素相加操作得到的；The image feature reconstruction module 920 is configured to perform feature reconstruction based on the multi-modal fusion image features between the visible light sample image and the thermal infrared sample image to obtain visible light reconstruction features and thermal infrared reconstruction features; wherein, the multi-modal The fused image features are obtained by performing an element-by-element addition operation based on the features of the visible light sample image and the features of the thermal infrared sample image;

损失数据确定模块930，用于确定所述可见光样本图像的可见光提取特征与所述可见光重建特征间的第一单模态相似损失、所述热红外样本图像的热红外提取特征与所述热红外重建特征间的第二单模态相似损失、所述热红外重建特征与所述可见光提取特征间的第一多模态交互损失、所述可见光重建特征与所述热红外提取特征间的第二多模态交互损失；Loss data determination module 930 is used to determine the first single-modal similarity loss between the visible light extraction feature of the visible light sample image and the visible light reconstruction feature, the thermal infrared extraction feature of the thermal infrared sample image and the thermal infrared The second single-modal similarity loss between the reconstructed features, the first multi-modal interaction loss between the thermal infrared reconstructed features and the visible light extraction features, the second between the visible light reconstructed features and the thermal infrared extracted features Multimodal interaction loss;

检测模型确定模块940，用于根据所述第一单模态相似损失、所述第二单模态相似损失、所述第一多模态交互损失以及所述第二多模态交互损失，对初始行人检测模型同时进行单模态监督训练和跨模态监督训练，直至满足模型停止训练条件，得到目标行人检测模型。The detection model determination module 940 is configured to determine the first single-modal similarity loss, the second single-modal similarity loss, the first multi-modal interaction loss, and the second multi-modal interaction loss. The initial pedestrian detection model undergoes single-modal supervised training and cross-modal supervised training at the same time until the model stop training conditions are met, and the target pedestrian detection model is obtained.

本说明书实施方式提供一种行人检测装置1000，请参阅图10，行人检测装置1000包括：待检图像获取模块1010、检测结果确定模块1020。The embodiment of this specification provides a pedestrian detection device 1000. Please refer to FIG. 10. The pedestrian detection device 1000 includes: an image acquisition module 1010 and a detection result determination module 1020.

待检图像获取模块1010，用于获取针对任一行人场景拍摄得到的可见光待检图像和热红外待检图像；The image to be inspected acquisition module 1010 is used to acquire the visible light image to be inspected and the thermal infrared image to be inspected that are captured for any pedestrian scene;

检测结果确定模块1020，用于将所述可见光待检图像以及所述可见光待检图像输入至上述任一实施方式得到的目标行人检测模型中进行行人检测，得到行人检测结果。The detection result determination module 1020 is used to input the visible light image to be detected and the visible light image to be detected into the target pedestrian detection model obtained in any of the above embodiments to perform pedestrian detection and obtain pedestrian detection results.

本说明书实施方式提供一种行人检测装置1100，请参阅图11，行人检测装置1100包括：待检图像获取模块1110、检测结果确定模块1120、相似损失确定模块1130、交互损失确定模块1140。The embodiment of this specification provides a pedestrian detection device 1100. Please refer to Figure 11. The pedestrian detection device 1100 includes: an image acquisition module 1110 to be detected, a detection result determination module 1120, a similarity loss determination module 1130, and an interaction loss determination module 1140.

待检图像获取模块1110，用于获取针对任一行人场景拍摄得到的可见光待检图像和热红外待检图像；The image to be inspected acquisition module 1110 is used to acquire the visible light image to be inspected and the thermal infrared image to be inspected that are captured for any pedestrian scene;

检测结果确定模块1120，用于将所述可见光待检图像以及所述可见光待检图像输入至目标行人检测模型中进行行人检测，得到行人检测结果；其中，所述目标行人检测模型在训练过程的损失数据包括单模态损失和跨模态损失；所述单模态损失和所述跨模态损失一起用于监督训练所述目标行人检测模提取所述可见光待检图像和所述热红外待检图像间的多模态融合图像特征的能力；The detection result determination module 1120 is used to input the visible light image to be detected and the visible light image to be detected into a target pedestrian detection model for pedestrian detection, and obtain pedestrian detection results; wherein, the target pedestrian detection model is used during the training process. The loss data includes single-modal loss and cross-modal loss; the single-modal loss and the cross-modal loss are used together to supervise the training of the target pedestrian detection model to extract the visible light image to be detected and the thermal infrared image to be detected. The ability to detect multi-modal fusion image features between images;

关于行人检测模型的训练装置、行人检测装置的具体描述，可以参见上文中对行人检测模型的训练方法、行人检测方法的描述，在此不再赘述。For a detailed description of the training device of the pedestrian detection model and the pedestrian detection device, please refer to the description of the training method of the pedestrian detection model and the pedestrian detection method above, and will not be described again here.

在一些实施方式中，提供了一种计算机设备，该计算机设备可以是终端，其内部结构图可以如图12所示。该计算机设备包括通过系统总线连接的处理器、存储器、通信接口、显示屏和输入装置。其中，该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统和计算机程序。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的通信接口用于与外部的终端进行有线或无线方式的通信，无线方式可通过WIFI、运营商网络、NFC(近场通信)或其他技术实现。该计算机程序被处理器执行时以实现一种行人检测模型的训练方法。该计算机设备的显示屏可以是液晶显示屏或者电子墨水显示屏，该计算机设备的输入装置可以是显示屏上覆盖的触摸层，也可以是计算机设备外壳上设置的按键、轨迹球或触控板，还可以是外接的键盘、触控板或鼠标等。In some embodiments, a computer device is provided. The computer device may be a terminal, and its internal structure diagram may be as shown in FIG. 12 . The computer device includes a processor, memory, communication interface, display screen and input device connected through a system bus. Wherein, the processor of the computer device is used to provide computing and control capabilities. The memory of the computer device includes non-volatile storage media and internal memory. The non-volatile storage medium stores operating systems and computer programs. This internal memory provides an environment for the execution of operating systems and computer programs in non-volatile storage media. The communication interface of the computer device is used for wired or wireless communication with external terminals. The wireless mode can be implemented through WIFI, operator network, NFC (Near Field Communication) or other technologies. The computer program, when executed by a processor, implements a method for training a pedestrian detection model. The display screen of the computer device may be a liquid crystal display or an electronic ink display. The input device of the computer device may be a touch layer covered on the display screen, or may be a button, trackball or touch pad provided on the computer device shell. , it can also be an external keyboard, trackpad or mouse, etc.

本领域技术人员可以理解，图12中示出的结构，仅仅是与本说明书所公开方案相关的部分结构的框图，并不构成对本说明书所公开方案所应用于其上的计算机设备的限定，具体地，计算机设备可以包括比图中所示更多或更少的部件，或者组合某些部件，或者具有不同的部件布置。Those skilled in the art can understand that the structure shown in Figure 12 is only a block diagram of a partial structure related to the solution disclosed in this specification, and does not constitute a limitation on the computer equipment to which the solution disclosed in this specification is applied. Specifically, A computer device may include more or fewer components than shown in the figures, or some combinations of components, or have a different arrangement of components.

本说明书实施方式提供一种芯片，包括存储单元和处理单元，存储单元存储有计算机程序，处理单元执行计算机程序时实现上述任一项实施方式的方法的步骤。An embodiment of this specification provides a chip, which includes a storage unit and a processing unit. The storage unit stores a computer program. When the processing unit executes the computer program, it implements the steps of the method in any of the above embodiments.

在一些实施方式中，提供了一种计算机设备，包括存储器和处理器，存储器中存储有计算机程序，该处理器执行计算机程序时实现上述实施方式中的方法步骤。In some embodiments, a computer device is provided, including a memory and a processor. A computer program is stored in the memory. When the processor executes the computer program, it implements the method steps in the above embodiments.

本说明书实施方式提供一种计算机可读存储介质，其上存储有计算机程序，计算机程序被处理器执行时实现上述任一项实施方式中的方法的步骤。The embodiments of this specification provide a computer-readable storage medium on which a computer program is stored. When the computer program is executed by a processor, the steps of the method in any of the above embodiments are implemented.

本说明书的一个实施方式提供一种计算机程序产品，计算机程序产品中包括指令，指令被计算机设备的处理器执行时，使得计算机设备能够执行上述任一项实施方式的方法的步骤。One embodiment of the present specification provides a computer program product. The computer program product includes instructions. When the instructions are executed by a processor of a computer device, the computer device can perform the steps of the method of any of the above embodiments.

需要说明的是，在流程图中表示或在此以其他方式描述的逻辑和/或步骤，例如，可以被认为是用于实现逻辑功能的可执行指令的定序列表，可以具体实现在任何计算机可读介质中，以供指令执行系统、装置或设备(如基于计算机的系统、包括处理器的系统或其他可以从指令执行系统、装置或设备取指令并执行指令的系统)使用，或结合这些指令执行系统、装置或设备而使用。就本说明书而言，"计算机可读介质"可以是任何可以包含、存储、通信、传播或传输程序以供指令执行系统、装置或设备或结合这些指令执行系统、装置或设备而使用的装置。计算机可读介质的更具体的示例(非穷尽性列表)包括以下：具有一个或多个布线的电连接部(电子装置)，便携式计算机盘盒(磁装置)，随机存取存储器(RAM)，只读存储器(ROM)，可擦除可编辑只读存储器(EPROM或闪速存储器)，光纤装置，以及便携式光盘只读存储器(CDROM)。另外，计算机可读介质甚至可以是可在其上打印程序的纸或其他合适的介质，因为可以例如通过对纸或其他介质进行光学扫描，接着进行编辑、解译或必要时以其他合适方式进行处理来以电子方式获得程序，然后将其存储在计算机存储器中。It should be noted that the logic and/or steps represented in the flowcharts or otherwise described herein, for example, may be considered to be a sequenced list of executable instructions for implementing logical functions, which may be embodied in any computer. in a readable medium for use by, or in conjunction with, an instruction execution system, apparatus, or device (such as a computer-based system, a system including a processor, or other system that can retrieve and execute instructions from the instruction execution system, apparatus, or device) Used by instruction execution systems, devices or equipment. For the purposes of this specification, a "computer-readable medium" may be any device that can contain, store, communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. More specific examples (non-exhaustive list) of computer readable media include the following: electrical connections with one or more wires (electronic device), portable computer disk cartridges (magnetic device), random access memory (RAM), Read-only memory (ROM), erasable and programmable read-only memory (EPROM or flash memory), fiber optic devices, and portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium may even be paper or other suitable medium on which the program may be printed, as the program may be printed, for example, by optical scanning of the paper or other medium, followed by editing, interpretation, or in other suitable manner if necessary Processing to obtain a program electronically and then store it in computer memory.

Claims

1. A training method for a pedestrian detection model, characterized in that the method includes:

Obtain visible light sample images and thermal infrared sample images for the target pedestrian scene;

Feature reconstruction is performed based on the multi-modal fusion image features between the visible light sample image and the thermal infrared sample image to obtain visible light reconstruction features and thermal infrared reconstruction features; wherein the multi-modal fusion image features are based on the visible light The characteristics of the sample image and the characteristics of the thermal infrared sample image are obtained by performing an element-by-element addition operation;

Determining a first single-mode similarity loss between the visible light extraction feature of the visible light sample image and the visible light reconstruction feature, and a second single-mode similarity loss between the thermal infrared extraction feature of the thermal infrared sample image and the thermal infrared reconstruction feature. State similarity loss, the first multi-modal interaction loss between the thermal infrared reconstruction feature and the visible light extraction feature, the second multi-modal interaction loss between the visible light reconstruction feature and the thermal infrared extraction feature;

According to the first single-modal similarity loss, the second single-modal similarity loss, the first multi-modal interaction loss and the second multi-modal interaction loss, the initial pedestrian detection model is simultaneously single-modal. Modal supervised training and cross-modal supervised training are performed until the model stop training conditions are met and the target pedestrian detection model is obtained.

2. The method of claim 1, wherein the initial pedestrian detection model includes a parallel first encoding network and a second encoding network, the first encoding network and the second encoding network are jointly connected to a fusion component. , the fusion component is connected with a parallel first decoding network and a second decoding network; the feature reconstruction is performed based on the multi-modal fusion image features between the visible light sample image and the thermal infrared sample image to obtain visible light reconstruction features and thermal infrared reconstruction features, including:

Input the visible light sample image into the first encoding network for feature extraction to obtain visible light modal features;

Input the thermal infrared sample image into the second encoding network for feature extraction to obtain thermal infrared modal features;

The fusion component performs an element-by-element addition operation on the visible light modal features and the thermal infrared modal features to obtain the multi-modal fused image features;

The multi-modal fusion image features are input into the first decoding network and the second decoding network for feature reconstruction respectively, and the visible light reconstruction features and the thermal infrared reconstruction features are correspondingly obtained.

3. The method according to claim 1, characterized in that the method for determining the multi-modal fusion image features includes:

Obtaining the visible light modal characteristics of the visible light sample image and the thermal infrared modal characteristics of the thermal infrared sample image;

An element-by-element addition operation is performed on the thermal infrared modal features and the visible light modal features to obtain the multi-modal fusion image features.

4. The method according to claim 1, characterized in that, the method further comprises:

Obtaining an initial visible light image and an initial thermal infrared image taken for the target pedestrian scene;

The initial visible light image and the initial thermal infrared image are preprocessed according to the input data size of the initial pedestrian detection model to obtain the visible light sample image and the thermal infrared sample image; wherein, after normalization, The visible light sample image and the normalized thermal infrared sample image are used as input data of the initial pedestrian detection model.

5. The method according to claim 4, characterized in that the visible light sample image and the thermal infrared sample image are normalized in the following manner:

According to the first channel mean and the first channel standard deviation corresponding to the visible light sample image, normalize the visible light sample image to obtain input data corresponding to the visible light sample image;

According to the second channel mean value and the second channel standard deviation corresponding to the thermal infrared sample image, the thermal infrared sample image is normalized to obtain input data corresponding to the thermal infrared sample image.

6. The method according to any one of claims 1 to 5, characterized in that the following loss function is used to calculate the model loss data:

L＝L1+L2+L3+L4

L1＝MSE(RGB _i ,RGB′ _i )

L2＝MSE(TH _i ,TH′ _i )

L3＝MSE(RGB _i ,TH′ _i )

L4＝MSE(TH _i ,RGB′ _i )

Among them, L is the model loss data, L1 is the first single-modal similarity loss, L2 is the second single-modal similarity loss, L3 is the first multi-modal interaction loss, L4 is the second multi-modal interaction loss, RGB _i is the visible light extraction feature, RGB′ _i is the visible light reconstruction feature, TH _i is the thermal infrared extraction feature, TH′ _i is the thermal infrared reconstruction feature, i represents the feature count number, and MSE represents the mean square error loss.

7. A pedestrian detection method, characterized in that the method includes:

Obtain visible light inspection images and thermal infrared inspection images taken for any pedestrian scene;

The visible light image to be detected and the visible light image to be detected are input into the target pedestrian detection model obtained by the method described in any one of claims 1 to 6 for pedestrian detection, and a pedestrian detection result is obtained.

8. A pedestrian detection method, characterized in that the method includes:

The visible light image to be detected and the visible light image to be detected are input into the target pedestrian detection model for pedestrian detection, and pedestrian detection results are obtained; wherein the loss data of the target pedestrian detection model during the training process includes single-modal loss and cross-modal loss; the single-modal loss and the cross-modal loss are used together to supervise the training of the target pedestrian detection model to extract the multi-modality between the visible light image to be detected and the thermal infrared image to be detected. The ability to fuse image features;

The single-modal loss includes a first single-modal similarity loss between the visible light extraction features and the visible light reconstruction features of the visible light sample image, and a second single-modal similarity loss between the thermal infrared extraction features and the thermal infrared reconstruction features of the thermal infrared sample image. similar losses;

The cross-modal loss includes a first multi-modal interaction loss between the thermal infrared reconstruction feature and the visible light extraction feature, and a second multi-modal interaction loss between the visible light reconstruction feature and the thermal infrared extraction feature. .

9. The method of claim 8, wherein the target pedestrian detection model includes a parallel first encoding network and a second encoding network, the first encoding network and the second encoding network are jointly connected to a fusion component. ; Inputting the visible light image to be detected and the visible light image to be detected into a target pedestrian detection model for pedestrian detection, and obtaining pedestrian detection results, including:

Input the visible light image to be detected into the first encoding network for feature extraction to obtain visible light features to be detected;

Input the thermal infrared image to be inspected into the second encoding network for feature extraction to obtain the thermal infrared image to be inspected;

The fusion component performs an element-by-element addition operation on the visible light features to be inspected and the thermal infrared features to be inspected to obtain the fused image features to be inspected;

Target detection is performed based on the fused image features to be detected, and the pedestrian detection result is obtained.

10. The method according to claim 9, characterized in that, performing pedestrian detection based on the fused image features to be detected to obtain the pedestrian detection result includes:

Convolution, pooling, and activation processing operations are performed according to the fused image features to be detected to perform target detection and obtain the pedestrian detection result.

11. A training device for pedestrian detection model, characterized in that the device includes:

The sample image acquisition module is used to acquire visible light sample images and thermal infrared sample images of the target pedestrian scene;

An image feature reconstruction module, configured to perform feature reconstruction based on the multi-modal fusion image features between the visible light sample image and the thermal infrared sample image, to obtain visible light reconstruction features and thermal infrared reconstruction features; wherein, the multi-modal fusion The image features are obtained by element-by-element addition based on the features of the visible light sample image and the features of the thermal infrared sample image;

Loss data determination module, used to determine the first single-modal similarity loss between the visible light extraction feature of the visible light sample image and the visible light reconstruction feature, the thermal infrared extraction feature of the thermal infrared sample image and the thermal infrared reconstruction feature The second single-modal similarity loss between features, the first multi-modal interaction loss between the thermal infrared reconstruction feature and the visible light extraction feature, the second multi-modal interaction loss between the visible light reconstruction feature and the thermal infrared extraction feature Modal interaction loss;

a detection model determination module, configured to determine the initial The pedestrian detection model performs single-modal supervised training and cross-modal supervised training at the same time until the model stop training conditions are met, and the target pedestrian detection model is obtained.

12. The device according to claim 11, wherein the initial pedestrian detection model includes a first encoding network and a second encoding network in parallel, the first encoding network and the second encoding network are jointly connected to the fusion component , the fusion component is connected to a parallel first decoding network and a second decoding network; the image feature reconstruction module is also used to input the visible light sample image into the first encoding network for feature extraction to obtain visible light Modal features; input the thermal infrared sample image into the second encoding network for feature extraction to obtain thermal infrared modal features; use the fusion component to combine the visible light modal features and the thermal infrared modal features. The features are added element by element to obtain the multi-modal fused image features; the multi-modal fused image features are input into the first decoding network and the second decoding network for feature reconstruction respectively, corresponding to The visible light reconstruction signature and the thermal infrared reconstruction signature.

13. The device according to claim 11, wherein the device further comprises a fusion feature determination module for obtaining the visible light modality feature of the visible light sample image and the thermal infrared modality of the thermal infrared sample image. Features: Perform an element-by-element addition operation on the thermal infrared modal features and the visible light modal features to obtain the multi-modal fusion image features.

14. The device according to any one of claims 11 to 13, characterized in that the following loss function is used to calculate the model loss data:

L＝L1+L2+L3+L4

L1＝MSE(RGB _i ,RGB′ _i )

L2＝MSE(TH _i ,TH′ _i )

L3＝MSE(RGB _i ,TH′ _i )

L4＝MSE(TH _i ,RGB′ _i )

15. A pedestrian detection device, characterized in that the device includes:

The image acquisition module to be inspected is used to acquire the visible light image to be inspected and the thermal infrared image to be inspected that are captured for any pedestrian scene;

Detection result determination module, used to input the visible light image to be detected and the visible light image to be detected into the target pedestrian detection model obtained by the method according to any one of claims 1 to 6 to perform pedestrian detection and obtain pedestrian detection results. .

16. A pedestrian detection device, characterized in that the device includes:

The detection result determination module is used to input the visible light image to be detected and the visible light image to be detected into a target pedestrian detection model for pedestrian detection, and obtain pedestrian detection results; wherein, the target pedestrian detection model during the training process The loss data includes single-modal loss and cross-modal loss; the single-modal loss and the cross-modal loss are used together to supervise the training of the target pedestrian detection model to extract the visible light image to be detected and the thermal infrared image to be detected. The ability to detect multi-modal fusion image features between images;

17. The device according to claim 16, wherein the target pedestrian detection model includes a parallel first encoding network and a second encoding network, the first encoding network and the second encoding network are jointly connected to the fusion component ;

The detection result determination module is also used to input the visible light image to be detected into the first encoding network for feature extraction to obtain visible light features to be detected; and to input the thermal infrared image to be detected into the second coding network. Feature extraction is performed in the coding network to obtain the thermal infrared features to be detected; the visible light features to be detected and the thermal infrared features to be detected are added element by element through the fusion component to obtain the fusion image features to be detected; based on the The image features to be detected are fused to perform target detection, and the pedestrian detection results are obtained.

18. A computer device, comprising a memory and a processor, the memory stores a computer program, characterized in that when the processor executes the computer program, the steps of the method according to any one of claims 1 to 10 are implemented. .

19. A computer-readable storage medium with a computer program stored thereon, characterized in that when the computer program is executed by a processor, the steps of the method according to any one of claims 1 to 10 are implemented.

20. A chip, including a storage unit and a processing unit, the storage unit stores a computer program, characterized in that when the processing unit executes the computer program, the method according to any one of claims 1 to 10 is implemented. A step of.