CN113761995A - Cross-mode pedestrian re-identification method based on double-transformation alignment and blocking - Google Patents

Cross-mode pedestrian re-identification method based on double-transformation alignment and blocking Download PDF

Info

Publication number
CN113761995A
CN113761995A CN202010814790.2A CN202010814790A CN113761995A CN 113761995 A CN113761995 A CN 113761995A CN 202010814790 A CN202010814790 A CN 202010814790A CN 113761995 A CN113761995 A CN 113761995A
Authority
CN
China
Prior art keywords
image
visible light
infrared
pedestrian
images
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010814790.2A
Other languages
Chinese (zh)
Inventor
陈洪刚
刘强
滕奇志
何小海
卿粼波
吴晓红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan University
Original Assignee
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan University filed Critical Sichuan University
Priority to CN202010814790.2A priority Critical patent/CN113761995A/en
Publication of CN113761995A publication Critical patent/CN113761995A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Traffic Control Systems (AREA)
  • Image Processing (AREA)
  • Closed-Circuit Television Systems (AREA)

Abstract

The invention provides a cross-mode pedestrian re-identification method based on double-transformation alignment and blocking. Firstly, extracting the features of the input infrared and visible light pedestrian images by using a basic branch network, linearly regressing a group of affine transformation parameters by using the high-level features of the images, and then generating an aligned image by using the parameters, wherein the image can effectively relieve the modal difference of misalignment. And then, horizontally dividing the aligned image into three blocks, taking out the characteristics of the three block images, and fusing the three block images with the aligned global characteristics and the original image characteristics to obtain the total characteristics of the visible light and the infrared image. Next, the total features of the infrared and visible images are mapped to the same embedding space. And finally, performing joint training by combining the identity loss and the most difficult batch sampling loss function with the weight to improve the identification precision. The invention is mainly applied to the video monitoring intelligent analysis application system, and has wide application prospect in the fields of image retrieval, intelligent security and the like.

Description

Cross-mode pedestrian re-identification method based on double-transformation alignment and blocking
Technical Field
The invention relates to a cross-modal pedestrian re-identification method based on double transformation alignment and blocking, and a new network model DTASN (Dual transform alignment and segmentation network), which relates to the problem of cross-modal pedestrian re-identification in the field of video intelligent monitoring and belongs to the field of computer vision and intelligent information processing.
Background
Pedestrian Re-Identification (ReID), a technique in the field of computer vision, is aimed at retrieving a Person of interest among a plurality of non-overlapping cameras, generally considered as a sub-problem of image retrieval. An efficient ReID algorithm can relieve the pain of video watching and accelerate the investigation process. The pedestrian re-identification has broad application prospects in the fields of video monitoring, intelligent security and the like, and has attracted extensive attention in the academic and industrial fields, so that the pedestrian re-identification becomes a research hotspot which has both high research value and high challenge in the field of computer vision.
Currently, most research is mainly focused on the RGB-RGB (single modality) pedestrian re-identification problem, where both probe and galery pedestrians are visible camera captures. However, visible light cameras may not be able to capture appearance information in lighting changes, especially when lighting conditions are insufficient (e.g., at night or in dark environments). Thanks to the development of the technology, most of the current new-generation cameras can automatically switch between the visible light mode and the infrared mode according to the light conditions. Therefore, it is necessary to develop some methods to solve the visible and infrared image cross-modality ReID problem. Different from the traditional pedestrian re-identification, the Visible light and infrared image cross-modal pedestrian re-identification is to match Visible light pedestrian images with different spectrums with pedestrian images captured by an infrared camera, and the Visible light image and infrared image cross-modal pedestrian re-identification VI-ReiD (Visible and associated person re-identification) mainly solves the cross-modal image matching. The VI-ReID generally searches for an infrared (or visible light) pedestrian image in the entire camera apparatus using a visible light (or infrared) pedestrian image.
The pedestrian image (cut pedestrian) is typically obtained by an automatic detector or tracker. However, due to imperfect results of human detection/tracking, misalignment of images is usually unavoidable, i.e. there are semantic misplacement errors such as partial occlusion, missing parts (only part of the body), excessive background, etc. To solve the semantic misplacement problem in ReID. Some efforts have attempted to improve the accuracy of pedestrian matching by reducing the cross-modal differences of heterogeneous data. In addition, there are methods that focus on solving the problem of pedestrian misalignment to improve the accuracy of pedestrian matching and thus reduce modal differences to some extent. In addition to the above difficulties, the appearance of pedestrians is also greatly changed due to changes in the posture and the angle of view. Many practical problems can cause spatial semantic misalignment between images, that is, the content semantics of two matching images corresponding to the same spatial position are different, thereby limiting the robustness and effectiveness of the human re-identification technology. Therefore, it is important to develop a model with strong discrimination capability to simultaneously process cross-modal changes, which not only can reduce the cross-modal differences of heterogeneous data, but also can alleviate image differences caused by misalignment between images in the modalities, thereby improving the accuracy of cross-modal pedestrian re-identification.
Disclosure of Invention
The invention provides a cross-mode pedestrian re-identification method based on double-transformation alignment and blocking, and designs a multi-path double-transformation alignment and segmentation network structure DTASN, wherein each training batch sampling strategy is as follows: randomly selecting P pedestrians from the training data set, then randomly selecting K visible light pedestrian images and K infrared pedestrian images for each pedestrian to form batch training data containing 2PK pedestrian images, and finally sending the 2PK pedestrian images into a network for training. Under the supervision of label information, self-learning capacity of a convolutional neural network is utilized to respectively perform self-adaptive alignment correction on a visible light image and an infrared image which are seriously staggered, and the aligned and corrected images are horizontally segmented to obtain images of local blocks, so that the aim of improving cross-modal pedestrian re-identification precision is fulfilled.
A cross-mode pedestrian re-identification method based on double-transformation alignment and blocking comprises the following steps:
(1) method for extracting visible light pedestrian image by using visible light-based branch network
Figure BDA0002632290710000027
Is characterized by obtaining
Figure BDA00026322907100000210
Infrared pedestrian image extraction method using infrared-based branch network
Figure BDA0002632290710000028
Is characterized by obtaining
Figure BDA0002632290710000029
(2) Taking out the characteristics of a fifth residual block (conv _5x) from the visible light base branch network, inputting the characteristics into a grid network of a visible light image space transformation module, and linearly regressing a group of affine transformation parameters
Figure BDA0002632290710000021
And generating a visible light image transformation grid, and then generating a new visible light pedestrian image through a bilinear sampler
Figure BDA00026322907100000212
Then to
Figure BDA00026322907100000211
Carrying out feature extraction to obtain the global features of the visible light pedestrians after transformation
Figure BDA0002632290710000022
(3) Taking out the characteristics of a fifth residual block (conv _5x) from the infrared base branch network, inputting the characteristics into a grid network of an infrared image space transformation module, and linearly regressing a group of affine transformation parameters
Figure BDA0002632290710000023
And generating an infrared image transformation grid, and then generating a new infrared pedestrian alignment image through a bilinear sampler
Figure BDA0002632290710000024
Then to
Figure BDA0002632290710000025
To carry outExtracting the features to obtain global features
Figure BDA0002632290710000026
(4) New visible light pedestrian image
Figure BDA00026322907100000318
Horizontally cutting into an upper non-overlapping block, a middle non-overlapping block and a lower non-overlapping block; then extracting the characteristics of the three blocks respectively to obtain the characteristics
Figure BDA0002632290710000031
And
Figure BDA0002632290710000032
finally, the global features of the image are aligned
Figure BDA0002632290710000033
Summing the three image characteristics to obtain the total characteristics of the visible light conversion alignment and segmentation network
Figure BDA0002632290710000034
(5) New infrared pedestrian image
Figure BDA0002632290710000035
Horizontally cutting into an upper non-overlapping block, a middle non-overlapping block and a lower non-overlapping block; then extracting the characteristics of the three blocks respectively to obtain the characteristics
Figure BDA0002632290710000036
And
Figure BDA0002632290710000037
finally, the global features of the image are aligned
Figure BDA0002632290710000038
Summing the three image characteristics to obtain the total characteristics of the infrared conversion alignment and segmentation network
Figure BDA0002632290710000039
(6) Will be provided with
Figure BDA00026322907100000310
Features extracted from visible light basic branch network
Figure BDA00026322907100000311
Performing weighted addition fusion to obtain the total characteristics of visible light branch
Figure BDA00026322907100000312
Will be provided with
Figure BDA00026322907100000313
Features extracted from infrared basic branch network
Figure BDA00026322907100000314
Carrying out weighted addition fusion to obtain the total characteristics of the infrared branches
Figure BDA00026322907100000315
Then the characteristics of the visible light image
Figure BDA00026322907100000316
And features of infrared images
Figure BDA00026322907100000317
And mapping the data to the same characteristic embedding space, and training by combining an identity loss function and a most difficult batch sampling loss function with weight, thereby finally improving the cross-modal pedestrian re-identification precision.
Drawings
FIG. 1 is a block diagram of a cross-mode pedestrian re-identification method based on double-transformation alignment and blocking according to the present invention;
fig. 2 is a diagram of a visible light transform alignment and blocking branch according to the present invention.
Fig. 3 is a block diagram of the infrared conversion alignment and blocking branch circuit of the present invention.
Detailed Description
The invention will be further described with reference to figures 1, 2 and 3:
the network structure and the principle of the DTASN model are as follows:
the network model framework learns feature representations and distance metrics in an end-to-end manner through a multipath double-aligned and block network while maintaining high resolvability. The frame includes three components: (1) the device comprises a feature extraction module, (2) a feature embedding module and (3) a loss calculation module. The backbone network junctions of all paths are the adopted deep residual network ResNet 50. Due to the lack of available data, the present invention initializes the network using a pre-trained ResNet50 model in order to speed up the convergence of the training process. To enhance the attention to the local features, the present invention applies a location attention module on each path.
For visible and infrared cross-modal pedestrian re-identification, the similarity lies in achromatic information of pedestrian contours and textures, and the significant difference lies in the imaging spectrum. Therefore, the invention designs a twin network model to extract the visual characteristics of the infrared and visible pedestrian images. As shown in fig. 1, the present invention uses two networks with the same structure to extract the feature representation of the visible light and the infrared image, and it is noted that the weights are not shared between them. The feature extraction module mainly comprises two main networks for processing visible light and infrared data: a base branching network and an alignment and segmentation network.
(1) A base branch network:
consisting of two identical sub-networks, whose weights are not shared, the backbone of the network is ResNet 50. The input images are all three-channel images, and the height and the width of the three-channel images are as follows: 288 × 144. Input images of the hypothetical visible and infrared-based branched networks are used separately
Figure BDA0002632290710000041
And
Figure BDA0002632290710000042
it is shown that the base branch network feature extractor is denoted by phi (-). Then
Figure BDA0002632290710000043
Can be expressed as a visible light image extracted by using a visible light basic branch network
Figure BDA0002632290710000044
The depth characteristic of (a) is,
Figure BDA0002632290710000045
can be represented as an infrared image extracted by using an infrared basic branch network
Figure BDA0002632290710000046
The depth characteristic of (a); all output feature vectors are 2048 in length.
(2) Space transformation module
Alignment principle of visible light and infrared conversion: linearly regressing a group of affine transformation parameters by using the fifth residual block characteristic conv _5x in the visible light and infrared base branches
Figure BDA0002632290710000047
And
Figure BDA0002632290710000048
then, the coordinate relation corresponding to the images before and after the affine transformation is established by the following formula (1):
Figure BDA0002632290710000049
wherein,
Figure BDA00026322907100000410
is the ith target coordinate in the regular grid of the target image,
Figure BDA00026322907100000411
is the source coordinates of the sample points in the input image,
Figure BDA00026322907100000412
and
Figure BDA00026322907100000413
is an affine transformation matrix in which13And theta23Controlling the shift, theta, of the converted image11,θ12,θ21And theta22Controlling the size and rotation change of the converted image; sampling an image grid by using bilinear sampling during affine transformation;
Figure BDA00026322907100000414
and
Figure BDA00026322907100000415
for the input image of the bilinear sampler, the new image of visible light and infrared outputted by space transformation is assumed to be
Figure BDA00026322907100000416
And
Figure BDA00026322907100000417
the correspondence between them is:
Figure BDA00026322907100000418
Figure BDA0002632290710000051
wherein,
Figure BDA0002632290710000052
and
Figure BDA0002632290710000053
a pixel value representing a coordinate (m, n) position in each channel in the target image,
Figure BDA0002632290710000054
and
Figure BDA0002632290710000055
representing the pixel value, H, at (n, m) coordinates in each channel in the source imageAnd W represents the height and width of the target image (or source image); bilinear sampling is continuously derivable, so the above equation is continuously derivable and allows gradient back propagation, thus enabling pedestrian adaptive alignment. Global feature availability for aligned images
Figure BDA0002632290710000056
And
Figure BDA0002632290710000057
and (4) showing. In addition, in order to learn more discriminative features, the present invention divides the transformed image horizontally into three non-overlapping fixed blocks.
(3) Visible light conversion alignment and blocking branch
As shown in fig. 2, the visible light image aligned by transformation is first horizontally divided into three non-overlapping blocks, i.e., an upper block, a middle block and a lower block; the first block height range pixels are 1 × 96, the second block height range pixels are 97 × 192, the third block height range pixels are 193 × 288, and the three block width pixels are all 144; then, the three area block images are respectively copied to the corresponding positions of 3 newly redefined sub-images with height and width of 288 × 144 and pixel values of all 0; then, extracting the transformed global features and 3 block sub-image features through 4 residual error networks respectively; the extracted features are respectively
Figure BDA0002632290710000058
And
Figure BDA0002632290710000059
the invention selects the global characteristic and the new image characteristics of 3 blocks to directly solve to obtain the total characteristic of the transformed image
Figure BDA00026322907100000510
Figure BDA00026322907100000511
Finally, will
Figure BDA00026322907100000512
Features related to visible-light-based branched networks
Figure BDA00026322907100000513
Obtaining the final characteristics of the visible light image by means of weighted addition fusion
Figure BDA00026322907100000514
Namely, it is
Figure BDA00026322907100000515
Where λ is a predefined trade-off parameter in the interval 0 to 1.
(4) Infrared conversion alignment and blocking branch
As shown in fig. 3, firstly, horizontally dividing the transformed and aligned infrared image into an upper, a middle and a lower non-overlapping blocks; the first block height range pixels are 1 × 96, the second block height range pixels are 97 × 192, the third block height range pixels are 193 × 288, and the three block width pixels are all 144; then, the three area block images are respectively copied to the corresponding positions of 3 newly redefined sub-images with height and width of 288 × 144 and pixel values of all 0; then, extracting the transformed global features and 3 block sub-image features through 4 residual error networks respectively; the extracted features are respectively
Figure BDA00026322907100000516
And
Figure BDA00026322907100000517
the invention selects to directly sum the global characteristic and the 3 block sub-image characteristics to obtain the total characteristic of the transformed image
Figure BDA0002632290710000061
Figure BDA0002632290710000062
Finally, will
Figure BDA0002632290710000063
Features related to infrared-based branched networks
Figure BDA0002632290710000064
Obtaining the final characteristics of the visible light image by means of weighted addition fusion
Figure BDA0002632290710000065
Namely, it is
Figure BDA0002632290710000066
Where λ is a predefined trade-off parameter in the interval 0 to 1 to balance the contributions of the two features.
(5) Feature embedding and loss computation
In order to reduce the difference of cross mode between the infrared image and the visible light image, the same nesting function f is usedθ,fθEssentially a fully connected layer (assuming its parameters are theta), characterizing the visible image
Figure BDA0002632290710000067
And infrared image characteristics
Figure BDA0002632290710000068
And mapping to the same feature space to obtain nested features
Figure BDA0002632290710000069
And
Figure BDA00026322907100000610
is abbreviated as
Figure BDA00026322907100000611
And
Figure BDA00026322907100000612
Figure BDA00026322907100000613
and
Figure BDA00026322907100000614
respectively representing one-dimensional feature vectors with the output length of 512; for simplicity of presentation, use is made of
Figure BDA00026322907100000615
To represent a visible light image batch
Figure BDA00026322907100000616
The jth image of the ith person in (1), similarly for an infrared image of a batch
Figure BDA00026322907100000617
The same is also true.
Identity loss function:
suppose that
Figure BDA00026322907100000618
And
Figure BDA00026322907100000619
then the
Figure BDA00026322907100000620
And
Figure BDA00026322907100000621
respectively represent the input pedestrian
Figure BDA00026322907100000622
And
Figure BDA00026322907100000623
the identity prediction probability of (a); for example,
Figure BDA00026322907100000624
representing predictive input visible light images
Figure BDA00026322907100000625
Is the probability of k; use of
Figure BDA00026322907100000626
And
Figure BDA00026322907100000627
input image representing true identity i
Figure BDA00026322907100000628
Of (2), i.e. of
Figure BDA00026322907100000629
And
Figure BDA00026322907100000630
then the identity loss function for predicting identity using cross-entropy loss in a batch is defined as:
Figure BDA00026322907100000631
weighted most difficult batch sampling loss function:
due to LidOnly the identity of each input sample is considered, and whether the input visible light and infrared belong to the same identity is not emphasized; in order to further relieve the cross-modal difference between the infrared image and the visible light image, the invention uses a single-batch self-adaptive weighted most difficult triple sampling loss function, which is different from TriHard loss, because TriHard loss only considers the information of extreme samples, thus causing extremely large local gradient and network collapse
Figure BDA0002632290710000071
ID identities and
Figure BDA0002632290710000072
same positive sample
Figure BDA0002632290710000073
For the positive sample pair, the larger the Euclidean distance in the nested feature space is, the larger the weight distribution is; in the same way, for
Figure BDA0002632290710000074
The ID and ID can also be calculated in all visible light images of the batch respectively
Figure BDA0002632290710000075
Different negative examples
Figure BDA0002632290710000076
For the negative sample pair, the larger the Euclidean distance in the nested feature space is, the smaller the weight distribution is; it can therefore be seen that different distances (i.e., different degrees of difficulty) are assigned different weights; therefore, the most difficult triple sampling loss function with the weight inherits the advantage of optimizing the relative distance between the positive sample pair and the negative sample pair, avoids introducing any redundant parameter, and enables the triple sampling loss function to be more flexible and strong in adaptability; thus, the anchor point samples are for each visible light image in each batch
Figure BDA0002632290710000077
Weighted least difficult triple sampling loss function
Figure BDA0002632290710000078
Is calculated as
Figure BDA0002632290710000079
Figure BDA00026322907100000710
Figure BDA00026322907100000711
Where p is the corresponding positive sample set and n isNegative set, Wi pIs a positive sample distance weight, Wi nRepresenting the distance weight of the negative sample; similarly, for each infrared image anchor point sample in each batch
Figure BDA00026322907100000712
Weighted least difficult triple sampling loss function
Figure BDA00026322907100000713
The calculation is as follows:
Figure BDA00026322907100000714
Figure BDA00026322907100000715
Figure BDA00026322907100000716
thus, the overall most difficult triplet sampling loss function with weights is:
Figure BDA0002632290710000081
finally, the total loss function is defined as:
Lwrt=Lid+λLc_wrt (14)
where λ is a predefined parameter for balancing ID identity loss LidAnd the most difficult triplet sampling loss with weight, Lc_wrtThe contribution of (c).
The invention carries out network structure ablation research on data sets of RegDB and SYSU-MM01, wherein Baseline represents a reference network, LidRepresents a recognition loss, Lc_wrtRepresents the most difficult triple sampling loss function with weights, RE is random erasure, PA represents the location attention module PAM, ST represents the STN spatial transformation network,HDB means horizontally chunking. In addition, the method is compared with some mainstream algorithms, evaluation is carried out by using a single query setting, and Rank-1, Rank-5, Rank-10 and mAP (average matching precision) are used as evaluation indexes. The experimental results are shown in tables 1, 2, 3 and 4, and the experimental accuracy is greatly improved compared with the reference network and other comparison algorithms.
TABLE 1 ablation study on network Structure RegDB data
Figure BDA0002632290710000082
TABLE 2 ablation study of network architecture on YSU-MM01 data
Figure BDA0002632290710000083
Table 3 comparison with mainstream algorithm results on RegDB dataset
Figure BDA0002632290710000091
Table 4 compares the results of the mainstream algorithm against the SYSU-MM01 dataset
Figure BDA0002632290710000092
Figure BDA0002632290710000101

Claims (6)

1. A cross-mode pedestrian re-identification method based on double-transformation alignment and blocking is characterized by comprising the following steps:
(1) method for extracting visible light pedestrian image by using visible light-based branch network
Figure FDA0002632290700000011
Is characterized by obtaining
Figure FDA0002632290700000012
Infrared pedestrian image extraction method using infrared-based branch network
Figure FDA0002632290700000013
Is characterized by obtaining
Figure FDA0002632290700000014
(2) Taking out the characteristics of a fifth residual block (conv _5x) from the visible light base branch network, inputting the characteristics into a grid network of a visible light image space transformation module, and linearly regressing a group of affine transformation parameters
Figure FDA0002632290700000015
And generating a visible light image transformation grid, and then generating a new visible light pedestrian alignment image through a bilinear sampler
Figure FDA0002632290700000016
Then to
Figure FDA0002632290700000017
Carrying out feature extraction to obtain global features
Figure FDA0002632290700000018
(3) Taking out the characteristics of a fifth residual block (conv _5x) from the infrared base branch network, inputting the characteristics into a grid network of an infrared image space transformation module, and linearly regressing a group of affine transformation parameters
Figure FDA0002632290700000019
And generating an infrared image transformation grid, and then generating a new infrared pedestrian alignment image through a bilinear sampler
Figure FDA00026322907000000110
Then to
Figure FDA00026322907000000111
Carrying out feature extraction to obtain global features
Figure FDA00026322907000000112
(4) New visible light pedestrian image
Figure FDA00026322907000000113
Horizontally cutting into an upper non-overlapping block, a middle non-overlapping block and a lower non-overlapping block; then extracting the characteristics of the three blocks respectively to obtain the characteristics
Figure FDA00026322907000000114
And
Figure FDA00026322907000000115
finally, the global features of the image are aligned
Figure FDA00026322907000000116
Summing the three image characteristics to obtain the total characteristics of the visible light conversion alignment and segmentation network
Figure FDA00026322907000000117
(5) New infrared pedestrian image
Figure FDA00026322907000000118
Horizontally cutting into an upper non-overlapping block, a middle non-overlapping block and a lower non-overlapping block; then extracting the characteristics of the three blocks respectively to obtain the characteristics
Figure FDA00026322907000000119
And
Figure FDA00026322907000000120
finally, the global features of the image are aligned
Figure FDA00026322907000000121
Summing the three image characteristics to obtain the total characteristics of the infrared conversion alignment and segmentation network
Figure FDA00026322907000000122
(6) Will be provided with
Figure FDA00026322907000000123
Features extracted from visible light basic branch network
Figure FDA00026322907000000124
Performing weighted addition fusion to obtain the total characteristics of visible light branch
Figure FDA00026322907000000125
Will be provided with
Figure FDA00026322907000000126
Features extracted from infrared basic branch network
Figure FDA00026322907000000127
Carrying out weighted addition fusion to obtain the total characteristics of the infrared branches
Figure FDA00026322907000000128
Then the characteristics of the visible light image
Figure FDA00026322907000000129
And features of infrared images
Figure FDA00026322907000000130
And mapping the data to the same characteristic embedding space, and training by combining an identity loss function and a most difficult batch sampling loss function with weight, thereby finally improving the cross-modal pedestrian re-identification precision.
2. The method of claim 1, wherein the sampling strategy for each training batch in step (1) is: randomly selecting P pedestrians from a training data set, then randomly selecting K visible light pedestrian images and K infrared pedestrian images for each pedestrian to form batch training data containing 2PK pedestrian images, and finally sending the 2PK pedestrian images into a network for training;
Figure FDA0002632290700000021
representing a visible light image extracted using a visible light basic branch network
Figure FDA0002632290700000022
The depth characteristic of (a) is,
Figure FDA0002632290700000023
representing infrared images extracted using an infrared-based branched network
Figure FDA0002632290700000024
The depth characteristic of (a); all output feature vectors are 2048 in length.
3. The method according to claim 1, wherein the transformation alignment is performed in steps (2) and (3) by using the fifth residual block conv _5x extracted from the visible light basic branch (infrared basic branch) to linearly regress a set of affine transformation parameters
Figure FDA0002632290700000025
And
Figure FDA0002632290700000026
then, establishing a coordinate relation corresponding to the images before and after the affine transformation through a formula (1):
Figure FDA0002632290700000027
wherein,
Figure FDA0002632290700000028
is the ith target coordinate in the regular grid of the target image,
Figure FDA0002632290700000029
is the source coordinates of the sample points in the input image,
Figure FDA00026322907000000210
and
Figure FDA00026322907000000211
is an affine transformation matrix in which13And theta23Controlling the shift, theta, of the converted image11,θ12,θ21And theta22Controlling the size and rotation change of the converted image; sampling an image grid by using bilinear sampling during affine transformation;
Figure FDA00026322907000000212
and
Figure FDA00026322907000000213
for the input image of the bilinear sampler, the new image of visible light and infrared outputted by space transformation is assumed to be
Figure FDA00026322907000000214
And
Figure FDA00026322907000000215
the correspondence between them is:
Figure FDA00026322907000000216
Figure FDA00026322907000000217
wherein,
Figure FDA00026322907000000218
and
Figure FDA00026322907000000219
a pixel value representing a coordinate (m, n) position in each channel in the target image,
Figure FDA00026322907000000220
and
Figure FDA00026322907000000221
represents the pixel value at the (n, m) coordinate in each channel in the source image, and H and W represent the height and width of the target image (or source image); bilinear sampling is continuously derivable, so the above equation is continuously derivable and allows gradient back propagation, thereby enabling pedestrian adaptive alignment, aligning global features of the image with available global features
Figure FDA00026322907000000222
And
Figure FDA00026322907000000223
it is shown that, in addition, the present invention horizontally divides the transformed image into three non-overlapping fixed blocks in order to learn more discriminative features.
4. The method according to claim 1, wherein in step (4), the transformed aligned image is first horizontally sliced into an upper, a middle and a lower blocks, respectively; the pixels in the first height range are 1-96, the pixels in the second height range are 97-192, the pixels in the third height range are 193-288, and the pixels in the three width ranges are all 144; then, the three area block images are respectively copied to the corresponding positions of 3 newly redefined sub-images with height and width of 288 × 144 and pixel values of all 0; next, the transformed global is extracted through 4 Resnet50 residual networks, respectivelyA feature and 3 sub-map features; the obtained characteristics are respectively
Figure FDA0002632290700000031
And
Figure FDA0002632290700000032
the invention selects a mode of directly summing the global characteristic and the 3 block sub-image characteristics to obtain the total characteristic of the transformed image
Figure FDA0002632290700000033
Figure FDA0002632290700000034
Finally, will
Figure FDA0002632290700000035
And the characteristics of the original map in the step (1)
Figure FDA0002632290700000036
Obtaining the final characteristics of the visible light image by means of weighted addition fusion
Figure FDA0002632290700000037
Namely, it is
Figure FDA0002632290700000038
Where λ is a predefined trade-off parameter in the interval 0 to 1 to balance the contributions of the two features.
5. The method according to claim 1, wherein in step (5), the transformed aligned image is first horizontally sliced into an upper, a middle and a lower blocks, respectively; the pixels in the first height range are 1-96, the pixels in the second height range are 97-192, the pixels in the third height range are 193-288, and the pixels in the three width ranges are all 144; then, the three region block images are respectively copied to the image data3 newly defined new height and width pixels are 288 multiplied by 144, and the pixel values are all 0 at the corresponding positions of the subgraph; next, extracting the transformed global features and 3 sub-graph features through 4 Resnet50 residual error networks respectively; the obtained characteristics are respectively
Figure FDA0002632290700000039
And
Figure FDA00026322907000000310
the invention selects a mode of directly summing the global characteristic and the 3 block sub-image characteristics to obtain the total characteristic of the transformed image
Figure FDA00026322907000000311
Figure FDA00026322907000000312
Finally, will
Figure FDA00026322907000000313
And the characteristics of the original map in the step (1)
Figure FDA00026322907000000314
Obtaining the final characteristics of the visible light image by means of weighted addition fusion
Figure FDA00026322907000000315
Namely, it is
Figure FDA00026322907000000316
Where λ is a predefined trade-off parameter in the interval 0 to 1 to balance the contributions of the two features.
6. The method according to claim 1, wherein in step (6) for reducing the cross-modal difference between the infrared image and the visible light image, the same nesting function f is usedθ,fθEssentially a fully connected layer (assuming its parameters are theta), characterizing the visible image
Figure FDA00026322907000000317
And infrared image characteristics
Figure FDA00026322907000000318
And mapping to the same feature space to obtain nested features
Figure FDA0002632290700000041
And
Figure FDA0002632290700000042
is abbreviated as
Figure FDA0002632290700000043
And
Figure FDA0002632290700000044
Figure FDA0002632290700000045
and
Figure FDA0002632290700000046
respectively representing one-dimensional feature vectors with the output length of 512; for simplicity of presentation, use is made of
Figure FDA0002632290700000047
To represent a visible light image batch
Figure FDA0002632290700000048
The jth image of the ith person in (1), similarly for an infrared image of a batch
Figure FDA0002632290700000049
Are also denoted by the same; suppose that
Figure FDA00026322907000000410
And
Figure FDA00026322907000000411
then the
Figure FDA00026322907000000412
And
Figure FDA00026322907000000413
respectively represent the input pedestrian
Figure FDA00026322907000000414
And
Figure FDA00026322907000000415
the identity prediction probability of (a); for example,
Figure FDA00026322907000000416
representing predictive input visible light images
Figure FDA00026322907000000417
Is the probability of k; use of
Figure FDA00026322907000000418
And
Figure FDA00026322907000000419
input image representing true identity i
Figure FDA00026322907000000420
Of (2), i.e. of
Figure FDA00026322907000000421
And
Figure FDA00026322907000000422
then the identity loss function for predicting identity using cross-entropy loss in a batch is defined as:
Figure FDA00026322907000000423
due to LidOnly the identity of each input sample is considered, and whether the input visible light and infrared belong to the same identity is not emphasized; in order to further relieve the cross-modal difference between the infrared image and the visible light image, the TriHardloss (the most difficult triple sampling loss) only considers the information of extreme samples, so that the local gradient is extremely large, and the network is broken down; unlike TriHardloss, the present invention uses the most difficult triple sampling loss function for single batch adaptive weighting; the core idea is that for each infrared image sample in a batch
Figure FDA00026322907000000424
ID identities and
Figure FDA00026322907000000425
same positive sample
Figure FDA00026322907000000426
For the positive sample pair, the larger the Euclidean distance in the nested feature space is, the larger the weight distribution is; in the same way, for
Figure FDA00026322907000000427
The ID and ID can also be calculated in all visible light images of the batch respectively
Figure FDA00026322907000000428
Different negative examples
Figure FDA00026322907000000429
For negative example pairs, in nested featuresThe bigger the European distance in the space is, the smaller the weight distribution is; it can therefore be seen that different distances (with different degrees of difficulty) are assigned different weights; therefore, the most difficult triple sampling loss function with the weight inherits the advantage of optimizing the relative distance between the positive sample pair and the negative sample pair, avoids introducing any redundant parameter, and enables the triple sampling loss function to be more flexible and strong in adaptability; thus, the anchor point samples are for each visible light image in each batch
Figure FDA00026322907000000430
Weighted least difficult triple sampling loss function
Figure FDA00026322907000000431
The calculation is as follows:
Figure FDA00026322907000000432
Figure FDA0002632290700000051
Figure FDA0002632290700000052
where p is the corresponding positive sample set, n is the negative sample set, Wi pIs a positive sample distance weight, Wi nRepresenting the distance weight of the negative sample; similarly, for each infrared image anchor point sample in each batch
Figure FDA0002632290700000053
Weighted least difficult triple sampling loss function
Figure FDA0002632290700000054
The calculation is as follows:
Figure FDA0002632290700000055
Figure FDA0002632290700000056
Figure FDA0002632290700000057
thus, the overall most difficult triplet sampling loss function with weights is:
Figure FDA0002632290700000058
finally, the total loss function is defined as:
Lwrt=Lid+λLc_wrt (11)
where λ is a predefined parameter for balancing ID identity loss LidAnd the most difficult triplet sampling loss with weight, Lc_wrtThe contribution of (c).
CN202010814790.2A 2020-08-13 2020-08-13 Cross-mode pedestrian re-identification method based on double-transformation alignment and blocking Pending CN113761995A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010814790.2A CN113761995A (en) 2020-08-13 2020-08-13 Cross-mode pedestrian re-identification method based on double-transformation alignment and blocking

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010814790.2A CN113761995A (en) 2020-08-13 2020-08-13 Cross-mode pedestrian re-identification method based on double-transformation alignment and blocking

Publications (1)

Publication Number Publication Date
CN113761995A true CN113761995A (en) 2021-12-07

Family

ID=78785620

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010814790.2A Pending CN113761995A (en) 2020-08-13 2020-08-13 Cross-mode pedestrian re-identification method based on double-transformation alignment and blocking

Country Status (1)

Country Link
CN (1) CN113761995A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114612937A (en) * 2022-03-15 2022-06-10 西安电子科技大学 Single-mode enhancement-based infrared and visible light fusion pedestrian detection method
CN116071369A (en) * 2022-12-13 2023-05-05 哈尔滨理工大学 Infrared image processing method and device
WO2023231233A1 (en) * 2022-05-31 2023-12-07 浪潮电子信息产业股份有限公司 Cross-modal target re-identification method and apparatus, device, and medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107480178A (en) * 2017-07-01 2017-12-15 广州深域信息科技有限公司 A kind of pedestrian's recognition methods again compared based on image and video cross-module state
US10176405B1 (en) * 2018-06-18 2019-01-08 Inception Institute Of Artificial Intelligence Vehicle re-identification techniques using neural networks for image analysis, viewpoint-aware pattern recognition, and generation of multi- view vehicle representations
CN111325115A (en) * 2020-02-05 2020-06-23 山东师范大学 Countermeasures cross-modal pedestrian re-identification method and system with triple constraint loss

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107480178A (en) * 2017-07-01 2017-12-15 广州深域信息科技有限公司 A kind of pedestrian's recognition methods again compared based on image and video cross-module state
US10176405B1 (en) * 2018-06-18 2019-01-08 Inception Institute Of Artificial Intelligence Vehicle re-identification techniques using neural networks for image analysis, viewpoint-aware pattern recognition, and generation of multi- view vehicle representations
CN111325115A (en) * 2020-02-05 2020-06-23 山东师范大学 Countermeasures cross-modal pedestrian re-identification method and system with triple constraint loss

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
BO LI ET AL.: "Visible Infrared Cross-Modality Person Re-Identification Network Based on Adaptive Pedestrian Alignment" *
MANG YE ET AL.: "Deep Learning for Person Re-identification: A Survey and Outlook" *
MANG YE ET AL.: "Visible Thermal Person Re-Identification via Dual-Constrained Top-Ranking" *
罗浩 ET AL.: "基于深度学习的行人重识别研究进展" *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114612937A (en) * 2022-03-15 2022-06-10 西安电子科技大学 Single-mode enhancement-based infrared and visible light fusion pedestrian detection method
WO2023231233A1 (en) * 2022-05-31 2023-12-07 浪潮电子信息产业股份有限公司 Cross-modal target re-identification method and apparatus, device, and medium
CN116071369A (en) * 2022-12-13 2023-05-05 哈尔滨理工大学 Infrared image processing method and device
CN116071369B (en) * 2022-12-13 2023-07-14 哈尔滨理工大学 Infrared image processing method and device

Similar Documents

Publication Publication Date Title
CN111259786B (en) Pedestrian re-identification method based on synchronous enhancement of appearance and motion information of video
CN108960140B (en) Pedestrian re-identification method based on multi-region feature extraction and fusion
CN110728263B (en) Pedestrian re-recognition method based on strong discrimination feature learning of distance selection
CN107832672B (en) Pedestrian re-identification method for designing multi-loss function by utilizing attitude information
CN112651262B (en) Cross-modal pedestrian re-identification method based on self-adaptive pedestrian alignment
CN113761995A (en) Cross-mode pedestrian re-identification method based on double-transformation alignment and blocking
WO2023087636A1 (en) Anomaly detection method and apparatus, and electronic device, storage medium and computer program product
CN114861761B (en) Loop detection method based on twin network characteristics and geometric verification
CN111709317A (en) Pedestrian re-identification method based on multi-scale features under saliency model
CN115841683A (en) Light-weight pedestrian re-identification method combining multi-level features
Rong et al. Picking point recognition for ripe tomatoes using semantic segmentation and morphological processing
CN116311384A (en) Cross-modal pedestrian re-recognition method and device based on intermediate mode and characterization learning
CN117274627A (en) Multi-temporal snow remote sensing image matching method and system based on image conversion
Chen et al. Self-supervised feature learning for long-term metric visual localization
Zhang et al. Fine-grained-based multi-feature fusion for occluded person re-identification
CN118038494A (en) Cross-modal pedestrian re-identification method for damage scene robustness
Zhang et al. Two-stage domain adaptation for infrared ship target segmentation
Gao et al. Occluded person re-identification based on feature fusion and sparse reconstruction
CN113011359A (en) Method for simultaneously detecting plane structure and generating plane description based on image and application
CN116597267B (en) Image recognition method, device, computer equipment and storage medium
CN116597177A (en) Multi-source image block matching method based on dual-branch parallel depth interaction cooperation
Zhang et al. Depth image based object Localization using binocular camera and dual-stream convolutional neural network
Xi et al. EMA‐GAN: A Generative Adversarial Network for Infrared and Visible Image Fusion with Multiscale Attention Network and Expectation Maximization Algorithm
CN114154576B (en) Feature selection model training method and system based on hybrid supervision
CN112784674B (en) Cross-domain identification method of key personnel search system based on class center self-adaption

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20211207

WD01 Invention patent application deemed withdrawn after publication