CN116052212A

CN116052212A - Semi-supervised cross-mode pedestrian re-recognition method based on dual self-supervised learning

Info

Publication number: CN116052212A
Application number: CN202310027835.5A
Authority: CN
Inventors: 朱小柯; 李允伟; 陈小潘; 郑明浩
Original assignee: Henan University
Current assignee: Henan University
Priority date: 2023-01-09
Filing date: 2023-01-09
Publication date: 2023-05-02

Abstract

The invention discloses a semi-supervised cross-mode pedestrian re-identification method based on double self-supervised learning, which comprises the following steps: a: constructing a cross-modal pedestrian re-identification data set; b: performing data enhancement processing on pedestrian images in the cross-mode pedestrian re-identification data set; c: constructing a main network, a context-based rotating self-supervision network and a contrast-learning-based self-supervision network of semi-supervision cross-mode pedestrian re-identification based on double self-supervision learning; d: obtaining final pedestrian image characteristics, a first probability matrix and a second probability matrix through the constructed network model; e: and performing a pedestrian re-recognition task based on self supervision by using the obtained final pedestrian image characteristics, the first probability matrix and the second probability matrix, and outputting a final recognition result. The invention can learn the consistency information of images of different modes by using a large amount of unmarked data to obtain more comprehensive pedestrian characteristic representation, thereby realizing the inter-mode pedestrian re-identification more accurately.

Description

Semi-supervised cross-mode pedestrian re-recognition method based on dual self-supervised learning

Technical Field

The invention relates to a pedestrian image recognition method, in particular to a semi-supervised cross-mode pedestrian re-recognition method based on double self-supervised learning.

Background

Pedestrian re-recognition (Person-identification), also known as pedestrian re-recognition, is a technique that uses computer vision techniques to determine whether a specific pedestrian exists in an image or video sequence, and is widely regarded as a sub-problem of image retrieval, i.e., given a monitored pedestrian image, the pedestrian image retrieval under cross-device conditions is performed. In the existing study of pedestrian re-recognition, the data set used for training and testing is often a single-mode RGB image, however, in the application of real scenes, the infrared mode camera, the depth camera and the witness state that the captured and described pedestrian image are quite common, so how to re-recognize the pedestrian across the visible light and infrared modes is one of the problems to be solved urgently. The cross-mode pedestrian re-identification is mainly performed on the problem that images of the same individual are searched and matched in an image library under two modes under the condition of giving visible light images or infrared images of specific individuals.

Currently, the problem of cross-modal pedestrian re-identification is mainly faced with the following challenges:

(1) There is a large difference in the images captured in the two modalities. The RGB image has three channels, which contain red, green and blue visible light color information, while the infrared image has only one channel, which contains near infrared light intensity information, and the wavelength ranges of the two are different from the imaging principle. The effects that different sharpness and lighting conditions can produce on the two types of images will vary greatly.

(2) Intra-modal differences, such as low resolution, occlusion, view angle variation, etc., that exist in conventional pedestrian re-recognition are still present in cross-modal pedestrian re-recognition.

In addition, although the existing methods have made certain progress in the cross-modal pedestrian re-recognition under extreme degradation conditions, there is still much room for improvement in performance. Most of the existing methods are based on training of a supervisory framework, their performance being largely dependent on a large number of labeled training samples. However, marking enough training samples requires a lot of manpower and material resources, and therefore, the lack of marking training data severely limits the supervision model in practical applications.

Disclosure of Invention

The invention aims to provide a semi-supervised cross-mode pedestrian re-recognition method based on dual self-supervised learning, which can more effectively utilize a large amount of unmarked data, learn the consistency information of images of different modes and obtain more comprehensive pedestrian characteristic representation, thereby realizing cross-mode pedestrian re-recognition more accurately.

The invention adopts the following technical scheme:

a semi-supervised cross-mode pedestrian re-identification method based on double self-supervised learning comprises the following steps:

a: constructing a cross-mode pedestrian re-identification data set, and preprocessing pedestrian images in the cross-mode pedestrian re-identification data set to obtain an input image with supervision training;

b: performing data enhancement processing on pedestrian images in the cross-mode pedestrian re-identification data set to obtain pedestrian images after the data enhancement processing, wherein the pedestrian images after the data enhancement processing comprise a rotating self-supervision image based on context and a self-supervision image based on contrast;

c: constructing a trunk network and a self-supervision training network of semi-supervision cross-mode pedestrian re-recognition based on double self-supervision learning; the self-supervision training network comprises a context-based rotating self-supervision network and a contrast learning-based self-supervision network; the backbone network, the context-based rotating self-monitoring network and the self-monitoring network based on contrast learning are arranged in parallel and share network weights;

the main network is used for performing supervised learning on the input image with the supervised training to acquire final pedestrian image characteristics; the context-based rotation self-supervision network is used for performing self-supervision learning on the context-based rotation self-supervision image to obtain a first probability matrix for rotation angle prediction; the self-supervision network is used for carrying out self-supervision learning on the self-supervision image based on contrast learning to obtain a second probability matrix for contrast self-supervision learning;

d: b, constructing a training set by utilizing the pedestrian image subjected to the enhancement processing in the step B, wherein the training set comprises a marked sample and an unmarked sample, the marked sample is used for obtaining final pedestrian image characteristics for supervised training through backbone network learning, and the unmarked sample is used for obtaining a first probability matrix for rotation angle prediction and a second probability matrix for contrast self-supervision learning through a context-based rotation self-supervision network and a contrast learning self-supervision network respectively;

e: and D, performing a pedestrian re-recognition task based on self-supervision through a main network and a self-supervision training network based on semi-supervision cross-mode pedestrian re-recognition of double self-supervision learning by using the final pedestrian image characteristics for supervised training, a first probability matrix for rotation angle prediction and a second probability matrix for comparison self-supervision learning, and outputting a final recognition result.

The step A comprises the following specific steps:

a1: constructing a cross-mode pedestrian re-recognition data set, acquiring pedestrian images in a training set in the cross-mode pedestrian re-recognition data set, and setting the total number of images input by a network model of a semi-supervised cross-mode pedestrian re-recognition method based on double self-supervision learning;

a2: the method comprises the steps of performing size adjustment on pedestrian images in a cross-mode pedestrian re-identification data set, and adjusting the width and the height of the pedestrian images to be the same size;

a3: c, randomly and horizontally overturning the pedestrian image with the size adjusted in the step A2;

a4: c, filling pixels in the pedestrian image subjected to random horizontal overturn in the step A3;

a5: randomly cutting the pedestrian image filled in the step A4;

a6: carrying out normalization processing on the pedestrian image subjected to random clipping in the step A5;

a7: and C, carrying out channel random erasure on the pedestrian image subjected to normalization processing in the step A6 to obtain an input image with supervision training.

The step B comprises the following specific steps:

b1: sequentially selecting one angle from the rotation angle set {0,90,180,270} for each pedestrian image with the size adjusted, and generating a pseudo tag for each pedestrian image with the rotated angle to obtain a context-based rotation self-supervision image;

b2: carrying out channel random erasing on the obtained input image with supervision training;

b3: and using channel exchange for the pedestrian image after the random erasure of the channel is completed, and obtaining a self-supervision image based on contrast.

In the step C, the backbone network sequentially comprises a first convolution layer, a first pooling layer, first to third residual layers, a first modal attention layer, a fourth residual layer, a second modal attention layer and a part alignment attention layer; the first convolution layer and the first to fourth residual layers perform feature extraction on the pedestrian image features after the dimension reduction layer by layer, and learn shallow features of the pedestrian image; the first modality attention layer is used for learning deep features of pedestrian images in two modalities; the part alignment attention layer is used for exploring a small gap between a visible light mode and an infrared mode to obtain final pedestrian image characteristics.

The first modality attention layer and the second modality attention layer have the same structure and are composed of two second convolution layers with the convolution kernel size of 1, a ReLU activation function and a Sigmod activation function; the calculation formulas of the first modality attention layer and the second modality attention layer are as follows:

wherein Z represents the depth characteristics of the obtained pedestrian image,

representing the matrix of Z after instance normalization, m _C For a channel mask, representing identity-related channels, m _C The calculation formula of (2) is m _C ＝σ(W ₂ δ(W ₁ g (Z)); g (·) represents the global average pooling layer, δ (·) represents the ReLU activation function, σ (·) represents the Sigmod activation function, W ₁ and W₂ Two fully connected layers in the modal attention layer are represented respectively, and are located after the ReLU activation function and the Sigmod activation function.

The context-based rotation self-supervision network sequentially comprises a third convolution layer, a second pooling layer, a third modal attention layer, a fifth residual error layer, a fourth modal attention layer, a global average pooling layer, a BN layer and a first full-connection layer, wherein the third modal attention layer and the fourth modal attention layer have the same structure as the first modal attention layer.

The self-supervision network structure based on contrast learning is to add a second full-connection layer on the basis of a backbone network, and the second full-connection layer is positioned behind the part alignment attention layer.

In the step E, the final pedestrian image characteristics of the input image with the supervision training are updated by using the back propagation through a set first cross entropy loss function and a center loss function;

cross entropy loss

The calculation formula of (2) is as follows:

wherein n and m respectively represent the number of visible light mode and infrared mode images in the current batch, f _v ,f _r Respectively representing the pedestrian image characteristics of the visible light mode and the pedestrian image characteristics of the infrared mode,

and />

Respectively represent f _v ,f _r Corresponding image tag, C (f _v) and C(f_r ) Respectively representing probability matrixes obtained by the pedestrian image characteristics of two modes through classifiers with the two parameters of theta, wherein P (·) is a softmax function;

center loss function

The calculation formula of (2) is as follows:

wherein ,f_i Representing the characteristics of the image of the pedestrian,

indicating that the current batch label is y _i Mean value of features of>

Indicating that the current batch label is y _k Mean value of features of>

Indicating that the current batch label is y _j T is the number of pedestrians in the current lot and ρ is the minimum spacing between all centers.

In the step E, the rotation angle is judged by a second cross entropy loss function based on the rotation self-supervision network of the context, and finally, the average precision result of pedestrian detection is output by the backbone network and is used for evaluating the accuracy rate of pedestrian re-identification;

second cross entropy loss function

The calculation formula of (2) is as follows:

wherein ,

representing an image after random rotation, +.>

Is a mark generated by random rotation angles of images, and R represents the total number of image samples in one batch. />

In the step E, based on the second probability matrix output by the self-supervision network of the comparison learning, the KL divergence is used as a consistency constraint loss function,

wherein ,p(x_i ) Representing a probability matrix obtained by a color image classifier in supervised learning, q (x) _i ) And representing a probability matrix obtained after the self-supervision characteristic based on comparison passes through the classifier.

The invention adopts the semi-supervised cross-modal pedestrian re-recognition method based on double self-supervised learning, effectively utilizes a large amount of unmarked data, obtains more comprehensive pedestrian image characteristics from the unmarked data, improves the characterization extraction capacity and the generalization capacity of the backbone network based on the context-based rotating self-supervised network and the self-supervised network based on contrast learning, and thereby realizes the cross-modal pedestrian re-recognition more accurately. Second, semi-supervised algorithms enable the most advanced predictive performance of supervised and unsupervised image classification without introducing additional hyper-parameters for optimization. Meanwhile, the semi-supervised algorithm does not need a separate pre-training step, but is trained in an end-to-end parallel manner, so that simplicity, efficiency and practicability are realized.

Drawings

FIG. 1 is a schematic flow chart of the present invention.

Detailed Description

The invention is described in detail below with reference to the attached drawings and examples:

as shown in fig. 1, the semi-supervised cross-mode pedestrian re-recognition method based on double self-supervised learning, provided by the invention, comprises the following steps:

in the invention, the cross-mode pedestrian re-identification data set comprises data sets SYSU and regDB, which are both presently disclosed pedestrian re-identification data sets. The dataset SYSU is a large-scale dataset collected by four visible cameras and two near-infrared cameras, including both indoor and outdoor environments. The training set in dataset SYSU contains 22258 visible images and 11909 infrared images, involving 395 identities, while the query set and the gallery set contain 3803 Zhang Gongwai images and 3010 Zhang Suiji sampled visible images. The dataset RegDB is made up of a pair of aligned cameras (one visible camera and one thermal camera). The dataset RegDB contains 8240 images of 412 identities, 10 from the visible camera and 10 from the thermal imaging camera in each image.

In the invention, the step A comprises the following specific steps:

a1: constructing a cross-mode pedestrian re-recognition data set, acquiring pedestrian images in a training set in the cross-mode pedestrian re-recognition data set, and setting the total number of images input by a network model of a semi-supervised cross-mode pedestrian re-recognition method based on double self-supervision learning to be 2 x p x k; wherein p is the number of pedestrians input into a network model of a semi-supervised cross-mode pedestrian re-recognition method based on double self-supervised learning in each batch, and k is the number of images randomly sampled in a single mode of each pedestrian;

in this embodiment, the pedestrian image in the training set in the cross-mode pedestrian re-recognition dataset may be read into the memory by using a Python programming language, and the hardware device of the experimental environment of the present invention is that the CPU is Intel (R) Core (TM) i9-10900K CPU@3.70GHz, the memory size is 32gb, the gpu model is NVIDIA Geforce RTX3090, the software platform Python version is 3.83, the cuda version is 11.1, and the model structure is built by using a deep learning frame with the PyTorch version being 17.0. Setting the number of pedestrian images input into a network model of a semi-supervised cross-mode pedestrian re-recognition method based on double self-supervised learning by each batch as p, and randomly extracting k images from the pedestrian images of each batch, namely setting the total number of images input into the network model at one time as 2 x p x k.

A2: the method comprises the steps of performing size adjustment on pedestrian images in a cross-mode pedestrian re-identification data set, and adjusting the width and the height of the pedestrian images to 224 pixels;

since in the rotating self-supervising module, the pedestrian image of this rectangle-like shape changes after rotation, if the height and width of the pedestrian are still set to 256 pixels and 128 pixels, respectively, as is conventional, it will be readily recognized by the model. In the embodiment, the width and height of all the input pedestrian images are set to 224 pixels, the image height-width ratio is 1:1, the external features of the pedestrian images are hardly changed after rotation, the difficulty of training tasks can be effectively increased, the network model is promoted to pay more attention to and extract the detailed features of the pedestrians in the pedestrian images, and therefore the generalization capability and the robustness of the model are improved.

In addition, the training of the main network based on the dual self-supervision learning semi-supervision cross-mode pedestrian re-recognition also uses the pedestrian image with the width and the height set to 224 pixels, so that the shallow characteristics of the pedestrian image learned from the pedestrian image with the size of 224 x 224 by the rotary self-supervision module can be effectively applied to the supervised training of the main network, thereby promoting the faster convergence of the main network and improving the training precision.

A3: and C, randomly and horizontally overturning the pedestrian image with the size adjusted in the step A2 to enhance the generalization capability of the model and relieve the overfitting.

A4: filling 10 pixels into the pedestrian image subjected to random horizontal overturn in the step A3 through a torchvision.transformation.pad () function, wherein the pixel value of each filled pixel is 127;

a5: randomly cutting the pedestrian image filled in the step A4;

in this embodiment, random clipping refers to randomly selecting a rectangular region from a pedestrian image, so that the pedestrian image generates different degrees of occlusion, and correcting the dislocation in the pedestrian image through a spatial transformation network layer in an affine estimation branch. The part with large background is cut, and the missing part of the pedestrian image is filled, so that the phenomenon of network overfitting is reduced, the network generalization capability is improved, and meanwhile, no extra parameter learning or more memory consumption is needed.

A6: and (3) carrying out normalization processing on the pedestrian image subjected to random clipping in the step (A5) so that the preprocessed pedestrian image data is limited in a set range, thereby eliminating adverse effects caused by singular sample data. After the data normalization processing, the speed of gradient descent to solve the optimal solution can be increased, so that the precision is improved.

A7: and C, randomly erasing the pedestrian image subjected to normalization processing in the step A6 through a channel with the probability of 0.5, and finally obtaining an input image with supervision training.

the step B comprises the following specific steps:

b1: sequentially selecting one angle from the rotation angle set {0,90,180,270} for each pedestrian image with the size adjusted in the step A2 randomly, and correspondingly generating a pseudo tag for each pedestrian image with the rotation angle to obtain a context-based rotation self-supervision image;

in this embodiment, the obtained context-based rotation self-monitoring image is utilized, and in cooperation with the context-based rotation self-monitoring network in the step C, the background feature of the training image can be ignored focusing on the beneficial attribute represented by the pedestrian image feature, so that the meaningful feature in the semantics including the rotation related portion and the irrelevant portion can be effectively learned, and the rotation invariance is incorporated into the self-monitoring network learning framework. The context-based rotation self-monitoring network in step C learns a segmented representation comprising rotation-related and uncorrelated portions and trains the neural network by jointly predicting image rotations and distinguishing individual instances. In the invention, the rotation recognition is decoupled from the instance recognition, so that the rotation prediction can be improved by reducing the influence of the noise of the rotation label, and the recognition instance of the image rotation is not considered, so that the obtained feature has better generalization capability.

B2: c, randomly erasing the channel with the probability of 0.5 on the input image with the supervision training obtained in the step A7;

b3: and B2, using channel exchange for the pedestrian image after the random erasure of the channel in the step B2, and finally obtaining the self-supervision image based on comparison.

In this embodiment, the color-independent images are uniformly generated by channel switching (i.e., randomly switching color channels), and the three-channel color visible light images contain abundant pedestrian characteristic information, and the color information in the pedestrian characteristic information is favorable for visible light infrared matching, so that the robustness to color variation is continuously improved. The channel random erase strategy in the step B2 and the random clipping strategy in the step A5 are combined, so that the diversity can be further enriched to obtain stronger resolvable property.

the main network is used for performing supervised learning on the input image with the supervised training obtained in the step A, and obtaining final pedestrian image characteristics with robustness;

the backbone network sequentially comprises a first convolution layer, a first pooling layer, first to third residual layers, a first modal attention layer, a fourth residual layer, a second modal attention layer and a part alignment attention layer; the first convolution layer and the first to fourth residual layers perform feature extraction on the feature of the pedestrian image (including the shallow features and the deep features of the pedestrian image) subjected to dimension reduction layer by layer; the first convolution layer and the first to fourth residual layers are used for learning shallow layer features of the pedestrian image; the first modality attention layer is used for learning deep features of pedestrian images under two modalities, has the same structure as the second modality attention layer, and consists of two second convolution layers with convolution kernel size of 1, a ReLU activation function and a Sigmod activation function; the calculation formulas of the first modality attention layer and the second modality attention layer are as follows:

The part alignment attention layer is used for exploring a small gap between two modes, dividing the global pedestrian image feature into six blocks, and obtaining a final pedestrian image feature through the feature vector obtained by combining the global pedestrian image feature and the local pedestrian image feature and the self-adaptive average pooling layer processing;

the context-based rotation self-supervision network is used for performing self-supervision learning on the context-based rotation self-supervision image obtained in the step B1, and finally obtaining a first probability matrix for rotation angle prediction; the context-based rotation self-supervision network sequentially comprises a third convolution layer, a second pooling layer, a third modal attention layer, a fifth residual layer, a fourth modal attention layer, a global average pooling layer, a BN layer and a first full-connection layer, wherein the third modal attention layer and the fourth modal attention layer have the same structure as the first modal attention layer;

in the invention, a context-based rotation self-supervision network applies a set of random geometric transformations to randomly rotate an input context-based rotation self-supervision image. Each random rotation self-supervision image corresponds to a pseudo tag, the context-based rotation self-supervision network is used for identifying the rotation angle of the random rotation self-supervision image, and if the context-based rotation self-supervision network cannot capture deep features of pedestrian images in the rotation self-supervision image, the context-based rotation self-supervision network cannot identify the rotation angle of the rotation self-supervision image. In the invention, the context-based rotating self-supervision network and the contrast learning-based self-supervision network share the weight of a main network, the context-based rotating self-supervision network is also provided with independent output, and the deep features of the pedestrian images of the rotated rotating self-supervision images are flattened into a vector with the dimension of 4 for classification through a global average pooling layer, a BN layer and a full connection layer, so that a first probability matrix for predicting the rotation angle is finally obtained to accurately identify the rotation angle of the rotating self-supervision images; dimension 4 represents the label corresponding to the four rotation angles.

The self-supervision network based on contrast learning is used for carrying out self-supervision learning on the contrast-based self-supervision image obtained in the step B3, and finally a second probability matrix for contrast self-supervision learning is obtained; the structure of the self-supervision network based on contrast learning is to add a second full-connection layer for color image classification on the basis of a backbone network, and the second full-connection layer is positioned behind the part alignment attention layer.

In the invention, the self-supervision network based on contrast learning is used for obtaining images with different color effects of the same person for a given visible light image through data enhancement, namely, channel random erasing and channel exchange are carried out on three channels R, G and B of the visible light image, and then after the characteristics of the pedestrian image are extracted through a main network sharing weight parameters, a second probability matrix for contrast self-supervision learning is obtained through a second full-connection layer.

Because the consistency constraint is only considered for the image classifiers (namely the full connection layers) of different modes in the supervised task, the backbone network only learns the shallow features of the pedestrian images among different modes, but the shallow features of the pedestrian images among the same modes are not considered. Because the invention learns the shallow layer characteristics of the pedestrian images between different modes and between the same modes through the self-supervision network based on contrast learning, the invariance between the input image with supervision training and the pedestrian image after enhancement processing can be well learned.

e: d, performing a pedestrian re-recognition task based on self-supervision through a main network and a self-supervision training network based on semi-supervision cross-mode pedestrian re-recognition of double self-supervision learning by using the final pedestrian image characteristics for supervised training, a first probability matrix for rotation angle prediction and a second probability matrix for comparison self-supervision learning, wherein the final pedestrian re-recognition task is performed based on self-supervision;

in the step E, the final pedestrian image characteristics of the input image with supervision training are updated by using back propagation through a set first cross entropy loss function and a center loss function;

for the final pedestrian image characteristics of the input image with the supervision training, the supervision training is carried out through a first cross entropy loss function and a center loss function, and the cross entropy loss is carried out

The calculation formula of (2) is as follows:

and />

Respectively represent f _v ,f _r Corresponding image tag, C (f _v) and C(f_r ) The probability matrix obtained by the classifier with the two parameters theta is respectively represented by the pedestrian image characteristics of the two modes, and P (·) is a softmax function (normalized exponential function);

center loss function

The calculation formula of (2) is as follows: />

wherein ,f_i Representing the characteristics of the image of the pedestrian,

representing the currentBatch label y _i Mean value of features of>

Indicating that the current batch label is y _k Mean value of features of>

Indicating that the current batch label is y _j T is the number of pedestrians in the current batch, ρ is the minimum spacing between all centers;

judging the rotation angle through a second cross entropy loss function based on the rotation self-supervision network of the context, and finally outputting an average precision result of pedestrian detection by the backbone network, wherein the average precision result is used for evaluating the accuracy rate of pedestrian re-identification;

in the invention, the context-based rotation self-supervision network outputs a first probability matrix through a second cross entropy loss function

And (3) performing calculation, wherein the formula is as follows:

wherein ,

representing an image after random rotation, +.>

The mark is generated by the random rotation angle of the image, and R represents the total number of image samples in one batch;

based on a second probability matrix output by the self-supervision network of contrast learning, the KL divergence is used as a consistency constraint loss function:

In the invention, the average precision result and Rank-1 (first matching average correct rate) obtained by the semi-supervised cross-mode pedestrian re-recognition method based on double self-supervised learning are respectively improved by 5.9 percent (85.97% -80.07 percent) and 8.27 percent (91.07% -82.8 percent) on a data set regDB, and the average precision result and Rank-1 are respectively improved by 1.35 percent (82.3% -80.95 percent) and 1.96 percent (78.7% -76.74 percent) in an indoor scene in a data set SYSU. The invention not only successfully applies the unsupervised learning to the pedestrian re-recognition field, but also proves that the semi-supervised cross-mode pedestrian re-recognition method based on the dual self-supervised learning can enhance the recognition robustness and effectively improve the accuracy of pedestrian re-recognition on a plurality of data sets.

Claims

1. The semi-supervised cross-mode pedestrian re-identification method based on double self-supervised learning is characterized by comprising the following steps of:

2. The semi-supervised cross-modal pedestrian re-recognition method based on dual self-supervised learning as set forth in claim 1, wherein the step a includes the following specific steps:

a5: randomly cutting the pedestrian image filled in the step A4;

3. The semi-supervised cross-modal pedestrian re-recognition method based on dual self-supervised learning as set forth in claim 2, wherein the step B includes the specific steps of:

4. The dual self-supervised learning-based semi-supervised cross-modal pedestrian re-recognition method as set forth in claim 1, wherein: in the step C, the backbone network sequentially comprises a first convolution layer, a first pooling layer, first to third residual layers, a first modal attention layer, a fourth residual layer, a second modal attention layer and a part alignment attention layer; the first convolution layer and the first to fourth residual layers perform feature extraction on the pedestrian image features after the dimension reduction layer by layer, and learn shallow features of the pedestrian image; the first modality attention layer is used for learning deep features of pedestrian images in two modalities; the part alignment attention layer is used for exploring a small gap between a visible light mode and an infrared mode to obtain final pedestrian image characteristics.

5. The semi-supervised cross-modal pedestrian re-recognition method based on dual self-supervised learning as set forth in claim 4, wherein: the first modality attention layer and the second modality attention layer have the same structure and are composed of two second convolution layers with the convolution kernel size of 1, a ReLU activation function and a Sigmod activation function; the calculation formulas of the first modality attention layer and the second modality attention layer are as follows:

6. The dual self-supervised learning-based semi-supervised cross-modal pedestrian re-recognition method as set forth in claim 1, wherein: the context-based rotation self-supervision network sequentially comprises a third convolution layer, a second pooling layer, a third modal attention layer, a fifth residual error layer, a fourth modal attention layer, a global average pooling layer, a BN layer and a first full-connection layer, wherein the third modal attention layer and the fourth modal attention layer have the same structure as the first modal attention layer.

7. The dual self-supervised learning-based semi-supervised cross-modal pedestrian re-recognition method as set forth in claim 1, wherein: the self-supervision network structure based on contrast learning is to add a second full-connection layer on the basis of a backbone network, and the second full-connection layer is positioned behind the part alignment attention layer.

8. The dual self-supervised learning-based semi-supervised cross-modal pedestrian re-recognition method as set forth in claim 1, wherein: in the step E, the final pedestrian image characteristics of the input image with the supervision training are updated by using the back propagation through a set first cross entropy loss function and a center loss function;

cross entropy loss

The calculation formula of (2) is as follows:

and />

center loss function

The calculation formula of (2) is as follows: />

wherein ,f_i Representing the characteristics of the image of the pedestrian,

indicating that the current batch label is y _i Mean value of features of>

Indicating that the current batch label is y _k Mean value of features of>

9. The dual self-supervised learning-based semi-supervised cross-modal pedestrian re-recognition method as set forth in claim 1, wherein: in the step E, the rotation angle is judged by a second cross entropy loss function based on the rotation self-supervision network of the context, and finally, the average precision result of pedestrian detection is output by the backbone network and is used for evaluating the accuracy rate of pedestrian re-identification;

second cross entropy loss function

The calculation formula of (2) is as follows:

wherein ,

representing an image after random rotation, +.>

Is a mark generated by random rotation angles of images, and R represents the total number of image samples in one batch.

10. The dual self-supervised learning-based semi-supervised cross-modal pedestrian re-recognition method as set forth in claim 1, wherein: in the step E, based on a second probability matrix output by the self-supervision network of the comparison learning, the KL divergence is used as a consistency constraint loss function;