CN113761995A

CN113761995A - A Cross-modal Pedestrian Re-identification Method Based on Double Transform Alignment and Blocking

Info

Publication number: CN113761995A
Application number: CN202010814790.2A
Authority: CN
Inventors: 陈洪刚; 刘强; 滕奇志; 何小海; 卿粼波; 吴晓红
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2020-08-13
Filing date: 2020-08-13
Publication date: 2021-12-07

Abstract

The invention provides a cross-mode pedestrian re-identification method based on double-transformation alignment and blocking. Firstly, extracting the features of the input infrared and visible light pedestrian images by using a basic branch network, linearly regressing a group of affine transformation parameters by using the high-level features of the images, and then generating an aligned image by using the parameters, wherein the image can effectively relieve the modal difference of misalignment. And then, horizontally dividing the aligned image into three blocks, taking out the characteristics of the three block images, and fusing the three block images with the aligned global characteristics and the original image characteristics to obtain the total characteristics of the visible light and the infrared image. Next, the total features of the infrared and visible images are mapped to the same embedding space. And finally, performing joint training by combining the identity loss and the most difficult batch sampling loss function with the weight to improve the identification precision. The invention is mainly applied to the video monitoring intelligent analysis application system, and has wide application prospect in the fields of image retrieval, intelligent security and the like.

Description

A Cross-modal Pedestrian Re-identification Method Based on Double Transform Alignment and Blocking

技术领域technical field

本发明涉及一种基于双变换对齐与分块的跨模态行人重识别方法，以及一种新的网络模型DTASN(Dual transform alignment and segmentation network)，涉及视频智能监控领域中的跨模态行人重识别问题，属于计算机视觉与智能信息处理领域。The invention relates to a cross-modal pedestrian re-identification method based on double transform alignment and segmentation, and a new network model DTASN (Dual transform alignment and segmentation network), and relates to a cross-modal pedestrian re-identification method in the field of video intelligent monitoring. The recognition problem belongs to the field of computer vision and intelligent information processing.

背景技术Background technique

行人重识别(Person Re-Identification，ReID)是计算机视觉领域中的一种技术，行人重识别的目的是在多个不重叠的摄像机中检索感兴趣的人，通常被认为是图像检索的一个子问题。一个高效的ReID算法可以缓解视频观看的痛苦，加速调查的进程。行人重识别在视频监控，智能安防等领域开阔的应用前景，引起了学术界和工业界的广泛关注，使其成为计算机视觉领域一个既很有研究价值又极具挑战性的研究热点。Person Re-Identification (ReID) is a technique in the field of computer vision. The purpose of Person Re-Identification is to retrieve persons of interest from multiple non-overlapping cameras, which is usually considered as a sub-section of image retrieval. question. An efficient ReID algorithm can ease the pain of video viewing and speed up the investigation process. The broad application prospect of pedestrian re-identification in video surveillance, intelligent security and other fields has attracted extensive attention from academia and industry, making it a valuable and challenging research hotspot in the field of computer vision.

当前，大多数研究主要集中在RGB-RGB(单模态)行人重识别问题，其中probe和gallery行人都是可见摄像机捕获。但是，可见光相机可能无法捕获光照变化中的外观信息，尤其是在光照条件不足时(例如，在夜间或黑暗的环境中)。得益于技术的发展，当前大多数新一代相机都可以根据光线条件自动进行可见光和红外模式切换。因此，有必要开发一些方法来解决可见光和红外图像跨模态ReID问题。与传统的行人重识别不同，可见光和红外图像跨模态行人重识别是将不同光谱的可见光行人图像和红外摄像机捕捉到的行人图像进行匹配，这种可见光图像与红外图像跨模态行人重识别VI-ReID(Visible andinfrared person re-identification)主要解决跨模态图像的匹配。VI-ReID通常使用可见光(或红外)行人图像来搜索整个摄像头设备中的红外(或可见光)行人图像。Currently, most research focuses on the RGB-RGB (single-modal) pedestrian re-identification problem, where both probe and gallery pedestrians are captured by visible cameras. However, visible light cameras may fail to capture appearance information in lighting changes, especially when lighting conditions are insufficient (eg, at night or in dark environments). Thanks to advances in technology, most current-generation cameras can automatically switch between visible and infrared modes depending on lighting conditions. Therefore, it is necessary to develop some methods to solve the cross-modal ReID problem of visible and infrared images. Different from the traditional pedestrian re-identification, the cross-modal pedestrian re-identification of visible light and infrared images is to match the visible-light pedestrian images of different spectra and the pedestrian images captured by the infrared camera. VI-ReID (Visible and infrared person re-identification) mainly solves the matching of cross-modal images. VI-ReID typically uses visible light (or infrared) pedestrian images to search for infrared (or visible light) pedestrian images throughout the camera device.

行人图像(裁剪的行人)通常是通过自动检测器或跟踪器获得的。然而，由于人的检测/跟踪结果不完善，图像的不对齐通常是不可避免的，即存在部分遮挡、部件缺失(只有身体的一部分)、背景过多等语义错位错误。为了解决ReID中的语义错位问题。一些工作试图通过减少异类数据的跨模态差异来提高行人匹配的准确性。另外，还有一些方法，着重解决行人不对齐的问题，以提高行人匹配的准确性，从而在某种程度上减小模态差异。除了上述困难之外，由于姿态和视角的变化，行人的外观也会发生很大的变化。许多实际问题都会导致图像之间的空间语义错位，即同一空间位置对应的两幅匹配图像的内容语义不同，从而限制了人员再识别技术的鲁棒性和有效性。因此，开发一个具有较强判别能力的模型来同时处理跨模态变化是很重要的，它不仅可以减少异构数据的跨模态差异，同时还可以缓解模态内图像之间不对齐带来的图像差异，从而提高跨模态行人重识别的精度。Pedestrian images (cropped pedestrians) are usually obtained by automatic detectors or trackers. However, due to imperfect human detection/tracking results, misalignment of images is usually unavoidable, i.e., there are semantic misalignment errors such as partial occlusion, missing parts (only part of the body), and excessive background. In order to solve the problem of semantic dislocation in ReID. Some works attempt to improve the accuracy of pedestrian matching by reducing cross-modal differences in heterogeneous data. In addition, there are some methods that focus on solving the problem of pedestrian misalignment to improve the accuracy of pedestrian matching, thereby reducing the modal difference to some extent. In addition to the above difficulties, the appearance of pedestrians also changes greatly due to changes in pose and perspective. Many practical problems can lead to spatial semantic dislocation between images, that is, the content semantics of two matching images corresponding to the same spatial location are different, which limits the robustness and effectiveness of person re-identification technology. Therefore, it is important to develop a model with strong discriminative ability to deal with cross-modal changes simultaneously, which can not only reduce the cross-modal differences in heterogeneous data, but also alleviate the misalignment between images within the modalities. image differences, thereby improving the accuracy of cross-modal person re-identification.

发明内容SUMMARY OF THE INVENTION

本发明提出了一种基于双变换对齐与分块的跨模态行人重识别方法，设计了一种多路径双变换对齐与切分网络结构DTASN，每个训练批次采样策略是：从训练数据集中随机选取P个行人，然后每个行人再随机选取K张可见光行人图像和K张红外行人图像，构成一个包含有2PK张行人图像的批次训练数据，最后将2PK张行人图像送入网络进行训练。在标签信息的监督下，利用卷积神经网络的自我学习能力，分别对错位严重的可见光图像和红外图像进行自适应的对齐矫正，并且将对齐矫正后的图像进行水平切分得到局部块的图像，从而达到提高跨模态行人重识别精度的目的。The invention proposes a cross-modal pedestrian re-identification method based on double-transform alignment and segmentation, and designs a multi-path double-transform alignment and segmentation network structure DTASN. The sampling strategy of each training batch is: from the training data Centrally select P pedestrians at random, and then randomly select K visible light pedestrian images and K infrared pedestrian images for each pedestrian to form a batch training data containing 2PK pedestrian images, and finally send 2PK pedestrian images into the network. train. Under the supervision of the label information, the self-learning ability of the convolutional neural network is used to perform adaptive alignment correction on the severely misplaced visible light images and infrared images, and the aligned and corrected images are horizontally segmented to obtain local block images. , so as to achieve the purpose of improving the accuracy of cross-modal pedestrian re-identification.

一种基于双变换对齐与分块的跨模态行人重识别方法，包括以下步骤：A cross-modal pedestrian re-identification method based on double transform alignment and block, comprising the following steps:

(1)利用可见光基分支网络提取可见光行人图像

的特征，得到

利用红外基分支网络去提取红外行人图像

的特征，得到

(1) Using visible light-based branch network to extract visible light pedestrian images

features, get

Extracting Infrared Pedestrian Images Using Infrared-Based Branch Networks

features, get

(2)从可见光基分支网络中取出第五残差块(conv_5x)特征输入到可见光图像空间变换模块的网格网络中，线性回归一组仿射变换参数

并生成可见光图像变换网格，然后通过双线性采样器生成新的可见光行人图像

然后对

进行特征提取，得到变换后的可见光行人全局特征

(2) Take the fifth residual block (conv_5x) feature from the visible light base branch network and input it into the grid network of the visible light image space transformation module, and linearly regress a set of affine transformation parameters

and generate a visible light image transformation grid, and then generate a new visible light pedestrian image through a bilinear sampler

then right

Perform feature extraction to obtain the transformed visible light pedestrian global feature

(3)从红外基分支网络中取出第五残差块(conv_5x)特征输入到红外图像空间变换模块的网格网络中，线性回归一组仿射变换参数

并生成红外图像变换网格，然后通过双线性采样器生成新的红外行人对齐图像

然后对

进行特征提取，得到全局特征

(3) Take the fifth residual block (conv_5x) feature from the infrared base branch network and input it into the grid network of the infrared image space transformation module, and linearly regress a set of affine transformation parameters

and generate an infrared image transformation grid, and then generate a new infrared pedestrian alignment image through a bilinear sampler

then right

Perform feature extraction to get global features

(4)将新的可见光行人图像

水平切分为上、中、下三个不重叠块；然后分别提取这三块的特征，得到特征

和

最后将对齐图像全局特征

和三块图像特征求和得到可见光变换对齐与切分网络的总特征

(4) The new visible light pedestrian image

The horizontal segmentation is divided into three non-overlapping blocks, upper, middle and lower; then the features of these three blocks are extracted respectively to obtain the features

and

Finally, the image global features will be aligned

Sum the three image features to get the total features of the visible light transformation alignment and segmentation network

(5)将新的红外行人图像

和

最后将对齐图像全局特征

和三块图像特征求和得到红外变换对齐与切分网络的总特征

(5) Put the new infrared pedestrian image

and

Finally, the image global features will be aligned

Sum the three image features to get the total features of the infrared transform alignment and segmentation network

(6)将

与可见光基础分支网络提取到的特征

进行加权相加融合，得到可见光分支的总特征

将

与红外基础分支网络提取到的特征

进行加权相加融合，得到红外分支的总特征

然后将可见光图像的特征

和红外图像的特征

映射到同一个特征嵌入空间中，结合身份损失函数和带权重的最难批次采样损失函数进行训练，最终提高跨模态行人重识别精度。(6) will

Features extracted from the visible light base branch network

Perform weighted addition and fusion to obtain the total features of the visible light branch

Will

Features extracted with the infrared base branch network

Perform weighted addition fusion to get the total features of the infrared branch

Then the features of the visible light image

and infrared image features

Map to the same feature embedding space, combine the identity loss function and the weighted most difficult batch sampling loss function for training, and finally improve the accuracy of cross-modal person re-identification.

附图说明Description of drawings

图1为本发明基于双变换对齐与分块的跨模态行人重识别方法框图；1 is a block diagram of a cross-modal pedestrian re-identification method based on double-transform alignment and segmentation of the present invention;

图2为本发明可见光变换对齐与分块支路结构图。FIG. 2 is a structural diagram of the visible light transformation alignment and block branch according to the present invention.

图3为本发明红外变换对齐与分块支路结构图。FIG. 3 is a structural diagram of the infrared transform alignment and block branch according to the present invention.

具体实施方式Detailed ways

下面结合附图1、附图2和附图3对本发明作进一步说明：Below in conjunction with accompanying drawing 1, accompanying drawing 2 and accompanying drawing 3, the present invention is further described:

DTASN模型网络结构和原理具体如下：The network structure and principle of the DTASN model are as follows:

该网络模型框架以端到端的方式通过多径双对齐与分块网络来学习特征表示和距离度量，同时保持较高的可分辨性。该框架包括三个组件：(1)特征提取模块，(2)特征嵌入模块，(3)损失计算模块。所有路径的骨干网结都是采用的深度残差网络ResNet50。由于可用数据的缺乏，为了加快训练过程的收敛速度，本发明使用预先训练好的ResNet50模型对网络进行初始化。为了加强对局部特征的注意，本发明在每条路径上都应用了位置注意模块。This network model framework learns feature representations and distance metrics through a multi-path dual alignment and block network in an end-to-end manner while maintaining high discriminability. The framework consists of three components: (1) feature extraction module, (2) feature embedding module, (3) loss calculation module. The backbone network of all paths is the deep residual network ResNet50. Due to the lack of available data, in order to speed up the convergence speed of the training process, the present invention uses the pre-trained ResNet50 model to initialize the network. To strengthen the attention to local features, the present invention applies a location attention module on each path.

对于可见光和红外交叉模态行人重识别，相似点在于行人轮廓和纹理的非彩色信息，显著差异在于成像光谱。因此，本发明设计了一个孪生网络模型来提取红外和可见行人图像的视觉特征。如附图1所示，本发明使用两个结构相同的网络提取可见光和红外图像的特征表示，注意他们之间权值不共享。特征提取模块主要包含了处理可见光和红外数据的两个主要网络：基分支网络和对齐与分割网络。For visible-light and infrared cross-modal pedestrian re-identification, the similarity lies in the achromatic information of pedestrian contours and textures, and the significant difference lies in the imaging spectrum. Therefore, the present invention designs a Siamese network model to extract visual features of infrared and visible pedestrian images. As shown in FIG. 1 , the present invention uses two networks with the same structure to extract the feature representation of visible light and infrared images, and it is noted that the weights do not share between them. The feature extraction module mainly contains two main networks for processing visible light and infrared data: the base branch network and the alignment and segmentation network.

(1)基分支网络：(1) Base branch network:

由两个相同的子网络组成，他们权重不共享，网络的骨干结构采用的是ResNet50。输入图像都为三通道图像，其高度和宽度为：288×144。假设可见光和红外基分支网络的输入图像分别用

和

表示，基分支网络特征提取器用φ(·)表示。那么

可表示为利用可见光基础分支网络提取到的可见光图像

的深度特征，

可表示为利用红外基础分支网络提取的红外图像

的深度特征；所有输出特征向量的长度为2048。It consists of two identical sub-networks, their weights are not shared, and the backbone structure of the network is ResNet50. The input images are all three-channel images whose height and width are: 288×144. Assume that the input images of the visible-light and infrared-based branch networks are respectively

and

, and the base branch network feature extractor is denoted by φ(·). So

It can be expressed as a visible light image extracted by using the visible light basic branch network

The depth features of ,

can be represented as an infrared image extracted using the infrared base branch network

The depth features of ; all output feature vectors are of length 2048.

(2)空间变换模块(2) Spatial transformation module

可见光和红外变换对齐原理：利用可见光和红外基分支中第五残差块特征conv_5x线性回归出一组仿射变换参数

和

然后，通过式(1)建立仿射变换前后图像对应的坐标关系：Visible light and infrared transformation alignment principle: use the fifth residual block feature conv_5x in the visible and infrared base branches to linearly regress a set of affine transformation parameters

and

Then, the coordinate relationship corresponding to the images before and after the affine transformation is established by formula (1):

其中，

是目标图像的规则网格中的第i个目标坐标，

是输入图像中采样点的源坐标，

和

是仿射变换矩阵，其中θ₁₃和θ₂₃控制转换图像的偏移，θ₁₁，θ₁₂，θ₂₁和θ₂₂控制转换图像的大小和旋转变化；仿射变换时使用双线性采样对图像网格进行采样；

和

为双线性采样器的输入图像，假定通过空间变换输出的可见光和红外新图像为

和

则他们之间的对应关系为:in,

is the ith target coordinate in the regular grid of the target image,

are the source coordinates of the sample points in the input image,

and

is an affine transformation matrix, where θ ₁₃ and θ ₂₃ control the offset of the transformed image, and θ ₁₁ , θ ₁₂ , θ ₂₁ and θ ₂₂ control the size and rotation changes of the transformed image; bilinear sampling is used for affine transformation on the image Grid for sampling;

and

is the input image of the bilinear sampler, assuming that the new visible and infrared images output by spatial transformation are

and

Then the corresponding relationship between them is:

其中,

和

表示目标图像中每个通道中坐标(m,n)位置的像素值，

和

表示源图像中每个通道中(n,m)坐标处的像素值，H和W表示目标图像(或源图像)的高度和宽度；双线性采样是连续可导的，因此上述方程式是连续可导并允许梯度反向传播，从而实现行人自适应对齐。对齐后图像的全局特征用可用

和

表示。此外，为了学习更加具有鉴别力的特征，本发明对变换后的图像水平分为三个不重叠的固定块。in,

and

represents the pixel value of the coordinate (m, n) position in each channel in the target image,

and

represents the pixel value at the (n,m) coordinate in each channel in the source image, and H and W represent the height and width of the target image (or source image); bilinear sampling is continuously differentiable, so the above equation is continuous It can induce and allow gradient back-propagation to achieve pedestrian adaptive alignment. The global features of the aligned images are available with

and

express. In addition, in order to learn more discriminative features, the present invention horizontally divides the transformed image into three non-overlapping fixed blocks.

(3)可见光变换对齐与分块支路(3) Visible light transformation alignment and block branch

如附图2所示，首先将变换对齐的可见光图像进行水平切分为上、中、下三个不重叠的块；第一块高度范围像素为1×96，第二块高度范围像素为97×192，第三块高度范围像素为193×288，三块宽度像素均为144；然后，分别将这三个区域块图像复制到重新定义的3张新的高宽像素为288×144，像素值全为0的子图的对应位置；接下来，通过4个残差网络分别提取变换后的全局特征和3个块子图特征；提取到的特征分别为

和

本发明选择将全局特征和3个分块的新图特征直接求，得到变换后图像的总特征

As shown in Figure 2, firstly, the transformed and aligned visible light image is horizontally divided into three non-overlapping blocks, upper, middle and lower; ×192, the height range of the third block is 193 × 288, and the width of the three blocks is 144; then, copy these three area block images to 3 new redefined height and width pixels of 288 × 144, pixel The corresponding positions of the subgraphs whose values are all 0; next, the transformed global features and the three block subgraph features are extracted through four residual networks; the extracted features are

and

The present invention chooses to directly obtain the global feature and the new image features of the three blocks to obtain the total feature of the transformed image.

最后再将

与可见光基分支网络的特征

通过加权相加的方式融合得到可见光图像的最终特征

即

其中λ是0到1区间的预定义权衡参数。Finally, the

Features of Visible Light-Based Branch Networks

The final features of the visible light image are obtained by fusion by weighted addition

which is

where λ is a predefined trade-off parameter in the interval 0 to 1.

(4)红外变换对齐与分块支路(4) Infrared transform alignment and block branch

如附图3所示，首先将变换对齐红外图像进行水平切分为上、中、下三个不重叠的块；第一块高度范围像素为1×96，第二块高度范围像素为97×192，第三块高度范围像素为193×288，三块宽度像素均为144；然后，分别将这三个区域块图像复制到重新定义的3张新的高宽像素为288×144，像素值全为0的子图的对应位置；接下来，通过4个残差网络分别提取变换后的全局特征和3个块子图特征；提取到的特征分别为

和

本发明选择将全局特征和3个块子图特征直接求和，得到变换后图像的总特征

As shown in Figure 3, firstly, the transformed and aligned infrared image is horizontally divided into three non-overlapping blocks; the height range of the first block is 1×96, and the height range of the second block is 97× 192, the height range of the third block is 193×288 pixels, and the width pixels of the three blocks are 144; then, copy the three area block images to the redefined 3 new height and width pixels of 288×144, the pixel value The corresponding positions of the subgraphs with all 0s; next, the transformed global features and the three block subgraph features are extracted through the 4 residual networks respectively; the extracted features are

and

In the present invention, the global feature and the three block sub-image features are directly summed to obtain the total feature of the transformed image.

最后再将

与红外基分支网络的特征

通过加权相加的方式融合得到可见光图像的最终特征

即

其中λ是0到1区间的预定义权衡参数，以平衡两个特征的贡献。Finally, the

Characteristics of branched networks with infrared bases

which is

where λ is a predefined trade-off parameter in the interval 0 to 1 to balance the contributions of the two features.

(5)特征嵌入与损失计算(5) Feature embedding and loss calculation

为减少红外图像和可见光图像之间的交叉模态差异，通过同一个嵌套函数f_θ，f_θ本质上为一个全连接层(假设其参数为θ)，将可见光图像特征

和红外图像特征

和映射到同一特征空间，以获得嵌套特征

和

简写为

和

和

分别表示输出长度为512的一维特征向量；为了简化表达，使用

来表示一个可见光图像批次

中的第i个人的第j张图像，同理对于一个批次的红外图像

也是同样的表示。In order to reduce the cross-modal difference between the infrared image and the visible light image, through the same nested function f _θ , f _θ is essentially a fully connected layer (assuming its parameter is θ), and the visible light image features are combined.

and infrared image features

and map to the same feature space to get nested features

and

abbreviated as

and

Respectively represent a one-dimensional feature vector with an output length of 512; in order to simplify the expression, use

to represent a batch of visible light images

The jth image of the ith person in , and similarly for a batch of infrared images

It is also the same expression.

身份损失函数：Identity loss function:

假设

和

然后

和

则分别代表输入行人

和

的身份预测概率；例如，

表示预测输入可见光图像

的身份为k的概率；使用

和

表示真实身份为i的输入图像

的标注信息，也即

和

那么一个批次中使用交叉熵损失预测身份的身份损失函数定义为：Assumption

and

Then

and

respectively represent the input pedestrians

and

The identity prediction probability of ; for example,

represents the prediction input visible light image

is the probability that the identity is k; use

and

represents the input image with real identity i

label information, that is,

and

Then the identity loss function for predicting identity using cross-entropy loss in a batch is defined as:

带权重的最难批次采样损失函数：Hardest batch sampling loss function with weights:

由于L_id只考虑每个输入样本的身份，并未强调输入的可见光和红外是否属于同一身份；为了进一步缓解红外图像与可见光图像之间的跨模态差异，由于TriHard loss(最难三元组采样损失)只考虑了极端样本的信息，造成局部梯度特别大，使得网络崩溃，与TriHard loss不同，本发明使用单批次自适应加权的最难三元组采样损失函数，其核心思想是，对于一个批次中的每个红外图像样本

可以在该批次中的可见光图像中，分别计算出ID身份与

相同的正样本

的距离，对于正样本对，在嵌套特征空间中欧式距离越大，权值分配越大；同理，对于

也可以在该批次的所有可见光图像中，分别计算出ID身份与

不同的负样本

的距离，对于负样本对，在嵌套特征空间中欧式距离越大，权值分配越小；因此可知，不同距离(即，困难程度不同)分配不同的权重；从而带权重的最难三元组采样损失函数继承了正负样本对之间相对距离优化的优点，避免了引入任何多余参数，使其更加灵活和适应性强；因此，对于每一个批次中每个可见光图像锚点样本

带权重的最难三元组采样损失函数

计算为Since L _id only considers the identity of each input sample, it does not emphasize whether the input visible light and infrared belong to the same identity; in order to further alleviate the cross-modal difference between the infrared image and the visible light image, due to TriHard loss (the most difficult triplet Sampling loss) only considers the information of extreme samples, resulting in a particularly large local gradient, which makes the network collapse. Different from TriHard loss, the present invention uses the most difficult triple sampling loss function with single batch adaptive weighting. The core idea is, For each infrared image sample in a batch

In the visible light images in this batch, the ID identities and

the same positive sample

The distance of , for positive sample pairs, the larger the Euclidean distance in the nested feature space, the larger the weight distribution; similarly, for

It is also possible to calculate the ID identities and

different negative samples

For the negative sample pair, the larger the Euclidean distance in the nested feature space, the smaller the weight distribution; therefore, it can be seen that different distances (ie, different degrees of difficulty) are assigned different weights; thus the most difficult ternary with weights The group sampling loss function inherits the advantages of relative distance optimization between positive and negative sample pairs, avoids introducing any redundant parameters, making it more flexible and adaptable; therefore, for each visible light image anchor point sample in each batch

Hardest triplet sampling loss function with weights

Calculated as

其中p为对应的正样本集，n为负集，W_i ^p为正样本距离权值，W_i ⁿ表示负样本距离权值；同理，对于每一个批次中每个红外图像锚点样本

带权重的最难三元组采样损失函数

计算为：where p is the corresponding positive sample set, n is the negative set, W _i ^p is the distance weight of the positive sample, and W _i ⁿ is the distance weight of the negative sample; similarly, for each infrared image anchor point sample in each batch

Hardest triplet sampling loss function with weights

Calculated as:

因此，整体的带权重的最难三元组采样损失函数为：Therefore, the overall weighted hardest triple sampling loss function is:

最终，总损失函数定义为：Finally, the total loss function is defined as:

L_wrt＝L_id+λL_{c_wrt} (14)L _wrt =L _id + _{λL c_wrt} (14)

其中λ为预定义参数，用于平衡ID身份损失L_id和带权重的最难三元组采样损失L_{c_wrt}的贡献。where λ is a predefined parameter to balance the contributions of the ID identity loss L _id and the weighted hardest triple sampling loss L _{c_wrt} .

本发明在RegDB和SYSU-MM01数据集进行了网络结构消融研究，其中Baseline表示基准网络，L_id表示识别损失，L_{c_wrt}表示带权重的最难三元组采样损失函数，RE是随机擦除，PA表示位置注意模块PAM，ST表示STN空间变换网络，HDB表示水平地分块。另外还和一些主流算法进行了比较，使用单一查询设置进行评估，并使用Rank-1,Rank-5,Rank-10和mAP(平均匹配精度)作为评价指标。实验结果如表1，表2，表3和表4所示，实验精度相比于基准网络和其他对比算法均有较大提高。The present invention conducts network structure ablation research on RegDB and SYSU-MM01 data sets, where Baseline represents the benchmark network, L _id represents the recognition loss, L _{c_wrt} represents the most difficult triple sampling loss function with weight, RE is random erasure, PA stands for Position Attention Module PAM, ST stands for STN Spatial Transform Network, and HDB stands for Horizontal Blocking. In addition, comparisons are made with some mainstream algorithms, using a single query setting for evaluation, and using Rank-1, Rank-5, Rank-10 and mAP (Mean Matching Accuracy) as evaluation metrics. The experimental results are shown in Table 1, Table 2, Table 3 and Table 4. Compared with the benchmark network and other comparison algorithms, the experimental accuracy is greatly improved.

表1网络结构RegDB数据上的消融研究Table 1 Ablation study on RegDB data with network structure

表2网络结构在YSU-MM01数据上的消融研究Table 2 Ablation study of network structure on YSU-MM01 data

表3在RegDB数据集上与主流算法结果对比Table 3 compares the results of the mainstream algorithms on the RegDB dataset

表4在SYSU-MM01数据集上与主流算法结果对比Table 4 compares the results with mainstream algorithms on the SYSU-MM01 dataset

Claims

1. A cross-mode pedestrian re-identification method based on double-transformation alignment and blocking is characterized by comprising the following steps:

(1) method for extracting visible light pedestrian image by using visible light-based branch network

Is characterized by obtaining

Infrared pedestrian image extraction method using infrared-based branch network

Is characterized by obtaining

(2) Taking out the characteristics of a fifth residual block (conv _5x) from the visible light base branch network, inputting the characteristics into a grid network of a visible light image space transformation module, and linearly regressing a group of affine transformation parameters

And generating a visible light image transformation grid, and then generating a new visible light pedestrian alignment image through a bilinear sampler

Then to

Carrying out feature extraction to obtain global features

(3) Taking out the characteristics of a fifth residual block (conv _5x) from the infrared base branch network, inputting the characteristics into a grid network of an infrared image space transformation module, and linearly regressing a group of affine transformation parameters

And generating an infrared image transformation grid, and then generating a new infrared pedestrian alignment image through a bilinear sampler

Then to

Carrying out feature extraction to obtain global features

(4) New visible light pedestrian image

Horizontally cutting into an upper non-overlapping block, a middle non-overlapping block and a lower non-overlapping block; then extracting the characteristics of the three blocks respectively to obtain the characteristics

And

finally, the global features of the image are aligned

Summing the three image characteristics to obtain the total characteristics of the visible light conversion alignment and segmentation network

(5) New infrared pedestrian image

And

finally, the global features of the image are aligned

Summing the three image characteristics to obtain the total characteristics of the infrared conversion alignment and segmentation network

(6) Will be provided with

Features extracted from visible light basic branch network

Performing weighted addition fusion to obtain the total characteristics of visible light branch

Will be provided with

Features extracted from infrared basic branch network

Carrying out weighted addition fusion to obtain the total characteristics of the infrared branches

Then the characteristics of the visible light image

And features of infrared images

And mapping the data to the same characteristic embedding space, and training by combining an identity loss function and a most difficult batch sampling loss function with weight, thereby finally improving the cross-modal pedestrian re-identification precision.

2. The method of claim 1, wherein the sampling strategy for each training batch in step (1) is: randomly selecting P pedestrians from a training data set, then randomly selecting K visible light pedestrian images and K infrared pedestrian images for each pedestrian to form batch training data containing 2PK pedestrian images, and finally sending the 2PK pedestrian images into a network for training;

representing a visible light image extracted using a visible light basic branch network

The depth characteristic of (a) is,

representing infrared images extracted using an infrared-based branched network

The depth characteristic of (a); all output feature vectors are 2048 in length.

3. The method according to claim 1, wherein the transformation alignment is performed in steps (2) and (3) by using the fifth residual block conv _5x extracted from the visible light basic branch (infrared basic branch) to linearly regress a set of affine transformation parameters

And

then, establishing a coordinate relation corresponding to the images before and after the affine transformation through a formula (1):

wherein,

is the ith target coordinate in the regular grid of the target image,

is the source coordinates of the sample points in the input image,

and

is an affine transformation matrix in which₁₃And theta₂₃Controlling the shift, theta, of the converted image₁₁，θ₁₂，θ₂₁And theta₂₂Controlling the size and rotation change of the converted image; sampling an image grid by using bilinear sampling during affine transformation;

and

for the input image of the bilinear sampler, the new image of visible light and infrared outputted by space transformation is assumed to be

And

the correspondence between them is:

wherein,

and

a pixel value representing a coordinate (m, n) position in each channel in the target image,

and

represents the pixel value at the (n, m) coordinate in each channel in the source image, and H and W represent the height and width of the target image (or source image); bilinear sampling is continuously derivable, so the above equation is continuously derivable and allows gradient back propagation, thereby enabling pedestrian adaptive alignment, aligning global features of the image with available global features

And

it is shown that, in addition, the present invention horizontally divides the transformed image into three non-overlapping fixed blocks in order to learn more discriminative features.

4. The method according to claim 1, wherein in step (4), the transformed aligned image is first horizontally sliced into an upper, a middle and a lower blocks, respectively; the pixels in the first height range are 1-96, the pixels in the second height range are 97-192, the pixels in the third height range are 193-288, and the pixels in the three width ranges are all 144; then, the three area block images are respectively copied to the corresponding positions of 3 newly redefined sub-images with height and width of 288 × 144 and pixel values of all 0; next, the transformed global is extracted through 4 Resnet50 residual networks, respectivelyA feature and 3 sub-map features; the obtained characteristics are respectively

And

the invention selects a mode of directly summing the global characteristic and the 3 block sub-image characteristics to obtain the total characteristic of the transformed image

Finally, will

And the characteristics of the original map in the step (1)

Obtaining the final characteristics of the visible light image by means of weighted addition fusion

Namely, it is

5. The method according to claim 1, wherein in step (5), the transformed aligned image is first horizontally sliced into an upper, a middle and a lower blocks, respectively; the pixels in the first height range are 1-96, the pixels in the second height range are 97-192, the pixels in the third height range are 193-288, and the pixels in the three width ranges are all 144; then, the three region block images are respectively copied to the image data3 newly defined new height and width pixels are 288 multiplied by 144, and the pixel values are all 0 at the corresponding positions of the subgraph; next, extracting the transformed global features and 3 sub-graph features through 4 Resnet50 residual error networks respectively; the obtained characteristics are respectively

And

Finally, will

And the characteristics of the original map in the step (1)

Namely, it is

6. The method according to claim 1, wherein in step (6) for reducing the cross-modal difference between the infrared image and the visible light image, the same nesting function f is used_θ，f_θEssentially a fully connected layer (assuming its parameters are theta), characterizing the visible image

And infrared image characteristics

And mapping to the same feature space to obtain nested features

And

is abbreviated as

And

and

respectively representing one-dimensional feature vectors with the output length of 512; for simplicity of presentation, use is made of

To represent a visible light image batch

The jth image of the ith person in (1), similarly for an infrared image of a batch

Are also denoted by the same; suppose that

And

then the

And

respectively represent the input pedestrian

And

the identity prediction probability of (a); for example,

representing predictive input visible light images

Is the probability of k; use of

And

input image representing true identity i

Of (2), i.e. of

And

due to L_idOnly the identity of each input sample is considered, and whether the input visible light and infrared belong to the same identity is not emphasized; in order to further relieve the cross-modal difference between the infrared image and the visible light image, the TriHardloss (the most difficult triple sampling loss) only considers the information of extreme samples, so that the local gradient is extremely large, and the network is broken down; unlike TriHardloss, the present invention uses the most difficult triple sampling loss function for single batch adaptive weighting; the core idea is that for each infrared image sample in a batch

ID identities and

same positive sample

For the positive sample pair, the larger the Euclidean distance in the nested feature space is, the larger the weight distribution is; in the same way, for

The ID and ID can also be calculated in all visible light images of the batch respectively

Different negative examples

For negative example pairs, in nested featuresThe bigger the European distance in the space is, the smaller the weight distribution is; it can therefore be seen that different distances (with different degrees of difficulty) are assigned different weights; therefore, the most difficult triple sampling loss function with the weight inherits the advantage of optimizing the relative distance between the positive sample pair and the negative sample pair, avoids introducing any redundant parameter, and enables the triple sampling loss function to be more flexible and strong in adaptability; thus, the anchor point samples are for each visible light image in each batch

Weighted least difficult triple sampling loss function

The calculation is as follows:

where p is the corresponding positive sample set, n is the negative sample set, W_i ^pIs a positive sample distance weight, W_i ⁿRepresenting the distance weight of the negative sample; similarly, for each infrared image anchor point sample in each batch

Weighted least difficult triple sampling loss function

The calculation is as follows:

thus, the overall most difficult triplet sampling loss function with weights is:

finally, the total loss function is defined as:

L_wrt＝L_id+λL_{c_wrt} (11)

where λ is a predefined parameter for balancing ID identity loss L_idAnd the most difficult triplet sampling loss with weight, L_{c_wrt}The contribution of (c).