CN116052212A - Semi-supervised cross-mode pedestrian re-recognition method based on dual self-supervised learning - Google Patents

Semi-supervised cross-mode pedestrian re-recognition method based on dual self-supervised learning Download PDF

Info

Publication number
CN116052212A
CN116052212A CN202310027835.5A CN202310027835A CN116052212A CN 116052212 A CN116052212 A CN 116052212A CN 202310027835 A CN202310027835 A CN 202310027835A CN 116052212 A CN116052212 A CN 116052212A
Authority
CN
China
Prior art keywords
pedestrian
self
supervision
image
learning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310027835.5A
Other languages
Chinese (zh)
Inventor
朱小柯
李允伟
陈小潘
郑明浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Henan University
Original Assignee
Henan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Henan University filed Critical Henan University
Priority to CN202310027835.5A priority Critical patent/CN116052212A/en
Publication of CN116052212A publication Critical patent/CN116052212A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/103Static body considered as a whole, e.g. static pedestrian or occupant recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/34Smoothing or thinning of the pattern; Morphological operations; Skeletonisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses a semi-supervised cross-mode pedestrian re-identification method based on double self-supervised learning, which comprises the following steps: a: constructing a cross-modal pedestrian re-identification data set; b: performing data enhancement processing on pedestrian images in the cross-mode pedestrian re-identification data set; c: constructing a main network, a context-based rotating self-supervision network and a contrast-learning-based self-supervision network of semi-supervision cross-mode pedestrian re-identification based on double self-supervision learning; d: obtaining final pedestrian image characteristics, a first probability matrix and a second probability matrix through the constructed network model; e: and performing a pedestrian re-recognition task based on self supervision by using the obtained final pedestrian image characteristics, the first probability matrix and the second probability matrix, and outputting a final recognition result. The invention can learn the consistency information of images of different modes by using a large amount of unmarked data to obtain more comprehensive pedestrian characteristic representation, thereby realizing the inter-mode pedestrian re-identification more accurately.

Description

Semi-supervised cross-mode pedestrian re-recognition method based on dual self-supervised learning
Technical Field
The invention relates to a pedestrian image recognition method, in particular to a semi-supervised cross-mode pedestrian re-recognition method based on double self-supervised learning.
Background
Pedestrian re-recognition (Person-identification), also known as pedestrian re-recognition, is a technique that uses computer vision techniques to determine whether a specific pedestrian exists in an image or video sequence, and is widely regarded as a sub-problem of image retrieval, i.e., given a monitored pedestrian image, the pedestrian image retrieval under cross-device conditions is performed. In the existing study of pedestrian re-recognition, the data set used for training and testing is often a single-mode RGB image, however, in the application of real scenes, the infrared mode camera, the depth camera and the witness state that the captured and described pedestrian image are quite common, so how to re-recognize the pedestrian across the visible light and infrared modes is one of the problems to be solved urgently. The cross-mode pedestrian re-identification is mainly performed on the problem that images of the same individual are searched and matched in an image library under two modes under the condition of giving visible light images or infrared images of specific individuals.
Currently, the problem of cross-modal pedestrian re-identification is mainly faced with the following challenges:
(1) There is a large difference in the images captured in the two modalities. The RGB image has three channels, which contain red, green and blue visible light color information, while the infrared image has only one channel, which contains near infrared light intensity information, and the wavelength ranges of the two are different from the imaging principle. The effects that different sharpness and lighting conditions can produce on the two types of images will vary greatly.
(2) Intra-modal differences, such as low resolution, occlusion, view angle variation, etc., that exist in conventional pedestrian re-recognition are still present in cross-modal pedestrian re-recognition.
In addition, although the existing methods have made certain progress in the cross-modal pedestrian re-recognition under extreme degradation conditions, there is still much room for improvement in performance. Most of the existing methods are based on training of a supervisory framework, their performance being largely dependent on a large number of labeled training samples. However, marking enough training samples requires a lot of manpower and material resources, and therefore, the lack of marking training data severely limits the supervision model in practical applications.
Disclosure of Invention
The invention aims to provide a semi-supervised cross-mode pedestrian re-recognition method based on dual self-supervised learning, which can more effectively utilize a large amount of unmarked data, learn the consistency information of images of different modes and obtain more comprehensive pedestrian characteristic representation, thereby realizing cross-mode pedestrian re-recognition more accurately.
The invention adopts the following technical scheme:
a semi-supervised cross-mode pedestrian re-identification method based on double self-supervised learning comprises the following steps:
a: constructing a cross-mode pedestrian re-identification data set, and preprocessing pedestrian images in the cross-mode pedestrian re-identification data set to obtain an input image with supervision training;
b: performing data enhancement processing on pedestrian images in the cross-mode pedestrian re-identification data set to obtain pedestrian images after the data enhancement processing, wherein the pedestrian images after the data enhancement processing comprise a rotating self-supervision image based on context and a self-supervision image based on contrast;
c: constructing a trunk network and a self-supervision training network of semi-supervision cross-mode pedestrian re-recognition based on double self-supervision learning; the self-supervision training network comprises a context-based rotating self-supervision network and a contrast learning-based self-supervision network; the backbone network, the context-based rotating self-monitoring network and the self-monitoring network based on contrast learning are arranged in parallel and share network weights;
the main network is used for performing supervised learning on the input image with the supervised training to acquire final pedestrian image characteristics; the context-based rotation self-supervision network is used for performing self-supervision learning on the context-based rotation self-supervision image to obtain a first probability matrix for rotation angle prediction; the self-supervision network is used for carrying out self-supervision learning on the self-supervision image based on contrast learning to obtain a second probability matrix for contrast self-supervision learning;
d: b, constructing a training set by utilizing the pedestrian image subjected to the enhancement processing in the step B, wherein the training set comprises a marked sample and an unmarked sample, the marked sample is used for obtaining final pedestrian image characteristics for supervised training through backbone network learning, and the unmarked sample is used for obtaining a first probability matrix for rotation angle prediction and a second probability matrix for contrast self-supervision learning through a context-based rotation self-supervision network and a contrast learning self-supervision network respectively;
e: and D, performing a pedestrian re-recognition task based on self-supervision through a main network and a self-supervision training network based on semi-supervision cross-mode pedestrian re-recognition of double self-supervision learning by using the final pedestrian image characteristics for supervised training, a first probability matrix for rotation angle prediction and a second probability matrix for comparison self-supervision learning, and outputting a final recognition result.
The step A comprises the following specific steps:
a1: constructing a cross-mode pedestrian re-recognition data set, acquiring pedestrian images in a training set in the cross-mode pedestrian re-recognition data set, and setting the total number of images input by a network model of a semi-supervised cross-mode pedestrian re-recognition method based on double self-supervision learning;
a2: the method comprises the steps of performing size adjustment on pedestrian images in a cross-mode pedestrian re-identification data set, and adjusting the width and the height of the pedestrian images to be the same size;
a3: c, randomly and horizontally overturning the pedestrian image with the size adjusted in the step A2;
a4: c, filling pixels in the pedestrian image subjected to random horizontal overturn in the step A3;
a5: randomly cutting the pedestrian image filled in the step A4;
a6: carrying out normalization processing on the pedestrian image subjected to random clipping in the step A5;
a7: and C, carrying out channel random erasure on the pedestrian image subjected to normalization processing in the step A6 to obtain an input image with supervision training.
The step B comprises the following specific steps:
b1: sequentially selecting one angle from the rotation angle set {0,90,180,270} for each pedestrian image with the size adjusted, and generating a pseudo tag for each pedestrian image with the rotated angle to obtain a context-based rotation self-supervision image;
b2: carrying out channel random erasing on the obtained input image with supervision training;
b3: and using channel exchange for the pedestrian image after the random erasure of the channel is completed, and obtaining a self-supervision image based on contrast.
In the step C, the backbone network sequentially comprises a first convolution layer, a first pooling layer, first to third residual layers, a first modal attention layer, a fourth residual layer, a second modal attention layer and a part alignment attention layer; the first convolution layer and the first to fourth residual layers perform feature extraction on the pedestrian image features after the dimension reduction layer by layer, and learn shallow features of the pedestrian image; the first modality attention layer is used for learning deep features of pedestrian images in two modalities; the part alignment attention layer is used for exploring a small gap between a visible light mode and an infrared mode to obtain final pedestrian image characteristics.
The first modality attention layer and the second modality attention layer have the same structure and are composed of two second convolution layers with the convolution kernel size of 1, a ReLU activation function and a Sigmod activation function; the calculation formulas of the first modality attention layer and the second modality attention layer are as follows:
Figure BDA0004045364000000041
wherein Z represents the depth characteristics of the obtained pedestrian image,
Figure BDA0004045364000000042
representing the matrix of Z after instance normalization, m C For a channel mask, representing identity-related channels, m C The calculation formula of (2) is m C =σ(W 2 δ(W 1 g (Z)); g (·) represents the global average pooling layer, δ (·) represents the ReLU activation function, σ (·) represents the Sigmod activation function, W 1 and W2 Two fully connected layers in the modal attention layer are represented respectively, and are located after the ReLU activation function and the Sigmod activation function.
The context-based rotation self-supervision network sequentially comprises a third convolution layer, a second pooling layer, a third modal attention layer, a fifth residual error layer, a fourth modal attention layer, a global average pooling layer, a BN layer and a first full-connection layer, wherein the third modal attention layer and the fourth modal attention layer have the same structure as the first modal attention layer.
The self-supervision network structure based on contrast learning is to add a second full-connection layer on the basis of a backbone network, and the second full-connection layer is positioned behind the part alignment attention layer.
In the step E, the final pedestrian image characteristics of the input image with the supervision training are updated by using the back propagation through a set first cross entropy loss function and a center loss function;
cross entropy loss
Figure BDA0004045364000000043
The calculation formula of (2) is as follows:
Figure BDA0004045364000000044
wherein n and m respectively represent the number of visible light mode and infrared mode images in the current batch, f v ,f r Respectively representing the pedestrian image characteristics of the visible light mode and the pedestrian image characteristics of the infrared mode,
Figure BDA00040453640000000512
and />
Figure BDA00040453640000000511
Respectively represent f v ,f r Corresponding image tag, C (f v) and C(fr ) Respectively representing probability matrixes obtained by the pedestrian image characteristics of two modes through classifiers with the two parameters of theta, wherein P (·) is a softmax function;
center loss function
Figure BDA0004045364000000051
The calculation formula of (2) is as follows:
Figure BDA0004045364000000052
wherein ,fi Representing the characteristics of the image of the pedestrian,
Figure BDA0004045364000000053
indicating that the current batch label is y i Mean value of features of>
Figure BDA0004045364000000054
Indicating that the current batch label is y k Mean value of features of>
Figure BDA0004045364000000055
Indicating that the current batch label is y j T is the number of pedestrians in the current lot and ρ is the minimum spacing between all centers.
In the step E, the rotation angle is judged by a second cross entropy loss function based on the rotation self-supervision network of the context, and finally, the average precision result of pedestrian detection is output by the backbone network and is used for evaluating the accuracy rate of pedestrian re-identification;
second cross entropy loss function
Figure BDA0004045364000000056
The calculation formula of (2) is as follows:
Figure BDA0004045364000000057
wherein ,
Figure BDA0004045364000000058
representing an image after random rotation, +.>
Figure BDA0004045364000000059
Is a mark generated by random rotation angles of images, and R represents the total number of image samples in one batch. />
In the step E, based on the second probability matrix output by the self-supervision network of the comparison learning, the KL divergence is used as a consistency constraint loss function,
Figure BDA00040453640000000510
wherein ,p(xi ) Representing a probability matrix obtained by a color image classifier in supervised learning, q (x) i ) And representing a probability matrix obtained after the self-supervision characteristic based on comparison passes through the classifier.
The invention adopts the semi-supervised cross-modal pedestrian re-recognition method based on double self-supervised learning, effectively utilizes a large amount of unmarked data, obtains more comprehensive pedestrian image characteristics from the unmarked data, improves the characterization extraction capacity and the generalization capacity of the backbone network based on the context-based rotating self-supervised network and the self-supervised network based on contrast learning, and thereby realizes the cross-modal pedestrian re-recognition more accurately. Second, semi-supervised algorithms enable the most advanced predictive performance of supervised and unsupervised image classification without introducing additional hyper-parameters for optimization. Meanwhile, the semi-supervised algorithm does not need a separate pre-training step, but is trained in an end-to-end parallel manner, so that simplicity, efficiency and practicability are realized.
Drawings
FIG. 1 is a schematic flow chart of the present invention.
Detailed Description
The invention is described in detail below with reference to the attached drawings and examples:
as shown in fig. 1, the semi-supervised cross-mode pedestrian re-recognition method based on double self-supervised learning, provided by the invention, comprises the following steps:
a: constructing a cross-mode pedestrian re-identification data set, and preprocessing pedestrian images in the cross-mode pedestrian re-identification data set to obtain an input image with supervision training;
in the invention, the cross-mode pedestrian re-identification data set comprises data sets SYSU and regDB, which are both presently disclosed pedestrian re-identification data sets. The dataset SYSU is a large-scale dataset collected by four visible cameras and two near-infrared cameras, including both indoor and outdoor environments. The training set in dataset SYSU contains 22258 visible images and 11909 infrared images, involving 395 identities, while the query set and the gallery set contain 3803 Zhang Gongwai images and 3010 Zhang Suiji sampled visible images. The dataset RegDB is made up of a pair of aligned cameras (one visible camera and one thermal camera). The dataset RegDB contains 8240 images of 412 identities, 10 from the visible camera and 10 from the thermal imaging camera in each image.
In the invention, the step A comprises the following specific steps:
a1: constructing a cross-mode pedestrian re-recognition data set, acquiring pedestrian images in a training set in the cross-mode pedestrian re-recognition data set, and setting the total number of images input by a network model of a semi-supervised cross-mode pedestrian re-recognition method based on double self-supervision learning to be 2 x p x k; wherein p is the number of pedestrians input into a network model of a semi-supervised cross-mode pedestrian re-recognition method based on double self-supervised learning in each batch, and k is the number of images randomly sampled in a single mode of each pedestrian;
in this embodiment, the pedestrian image in the training set in the cross-mode pedestrian re-recognition dataset may be read into the memory by using a Python programming language, and the hardware device of the experimental environment of the present invention is that the CPU is Intel (R) Core (TM) i9-10900K CPU@3.70GHz, the memory size is 32gb, the gpu model is NVIDIA Geforce RTX3090, the software platform Python version is 3.83, the cuda version is 11.1, and the model structure is built by using a deep learning frame with the PyTorch version being 17.0. Setting the number of pedestrian images input into a network model of a semi-supervised cross-mode pedestrian re-recognition method based on double self-supervised learning by each batch as p, and randomly extracting k images from the pedestrian images of each batch, namely setting the total number of images input into the network model at one time as 2 x p x k.
A2: the method comprises the steps of performing size adjustment on pedestrian images in a cross-mode pedestrian re-identification data set, and adjusting the width and the height of the pedestrian images to 224 pixels;
since in the rotating self-supervising module, the pedestrian image of this rectangle-like shape changes after rotation, if the height and width of the pedestrian are still set to 256 pixels and 128 pixels, respectively, as is conventional, it will be readily recognized by the model. In the embodiment, the width and height of all the input pedestrian images are set to 224 pixels, the image height-width ratio is 1:1, the external features of the pedestrian images are hardly changed after rotation, the difficulty of training tasks can be effectively increased, the network model is promoted to pay more attention to and extract the detailed features of the pedestrians in the pedestrian images, and therefore the generalization capability and the robustness of the model are improved.
In addition, the training of the main network based on the dual self-supervision learning semi-supervision cross-mode pedestrian re-recognition also uses the pedestrian image with the width and the height set to 224 pixels, so that the shallow characteristics of the pedestrian image learned from the pedestrian image with the size of 224 x 224 by the rotary self-supervision module can be effectively applied to the supervised training of the main network, thereby promoting the faster convergence of the main network and improving the training precision.
A3: and C, randomly and horizontally overturning the pedestrian image with the size adjusted in the step A2 to enhance the generalization capability of the model and relieve the overfitting.
A4: filling 10 pixels into the pedestrian image subjected to random horizontal overturn in the step A3 through a torchvision.transformation.pad () function, wherein the pixel value of each filled pixel is 127;
a5: randomly cutting the pedestrian image filled in the step A4;
in this embodiment, random clipping refers to randomly selecting a rectangular region from a pedestrian image, so that the pedestrian image generates different degrees of occlusion, and correcting the dislocation in the pedestrian image through a spatial transformation network layer in an affine estimation branch. The part with large background is cut, and the missing part of the pedestrian image is filled, so that the phenomenon of network overfitting is reduced, the network generalization capability is improved, and meanwhile, no extra parameter learning or more memory consumption is needed.
A6: and (3) carrying out normalization processing on the pedestrian image subjected to random clipping in the step (A5) so that the preprocessed pedestrian image data is limited in a set range, thereby eliminating adverse effects caused by singular sample data. After the data normalization processing, the speed of gradient descent to solve the optimal solution can be increased, so that the precision is improved.
A7: and C, randomly erasing the pedestrian image subjected to normalization processing in the step A6 through a channel with the probability of 0.5, and finally obtaining an input image with supervision training.
B: performing data enhancement processing on pedestrian images in the cross-mode pedestrian re-identification data set to obtain pedestrian images after the data enhancement processing, wherein the pedestrian images after the data enhancement processing comprise a rotating self-supervision image based on context and a self-supervision image based on contrast;
the step B comprises the following specific steps:
b1: sequentially selecting one angle from the rotation angle set {0,90,180,270} for each pedestrian image with the size adjusted in the step A2 randomly, and correspondingly generating a pseudo tag for each pedestrian image with the rotation angle to obtain a context-based rotation self-supervision image;
in this embodiment, the obtained context-based rotation self-monitoring image is utilized, and in cooperation with the context-based rotation self-monitoring network in the step C, the background feature of the training image can be ignored focusing on the beneficial attribute represented by the pedestrian image feature, so that the meaningful feature in the semantics including the rotation related portion and the irrelevant portion can be effectively learned, and the rotation invariance is incorporated into the self-monitoring network learning framework. The context-based rotation self-monitoring network in step C learns a segmented representation comprising rotation-related and uncorrelated portions and trains the neural network by jointly predicting image rotations and distinguishing individual instances. In the invention, the rotation recognition is decoupled from the instance recognition, so that the rotation prediction can be improved by reducing the influence of the noise of the rotation label, and the recognition instance of the image rotation is not considered, so that the obtained feature has better generalization capability.
B2: c, randomly erasing the channel with the probability of 0.5 on the input image with the supervision training obtained in the step A7;
b3: and B2, using channel exchange for the pedestrian image after the random erasure of the channel in the step B2, and finally obtaining the self-supervision image based on comparison.
In this embodiment, the color-independent images are uniformly generated by channel switching (i.e., randomly switching color channels), and the three-channel color visible light images contain abundant pedestrian characteristic information, and the color information in the pedestrian characteristic information is favorable for visible light infrared matching, so that the robustness to color variation is continuously improved. The channel random erase strategy in the step B2 and the random clipping strategy in the step A5 are combined, so that the diversity can be further enriched to obtain stronger resolvable property.
C: constructing a trunk network and a self-supervision training network of semi-supervision cross-mode pedestrian re-recognition based on double self-supervision learning; the self-supervision training network comprises a context-based rotating self-supervision network and a contrast learning-based self-supervision network; the backbone network, the context-based rotating self-monitoring network and the self-monitoring network based on contrast learning are arranged in parallel and share network weights;
the main network is used for performing supervised learning on the input image with the supervised training obtained in the step A, and obtaining final pedestrian image characteristics with robustness;
the backbone network sequentially comprises a first convolution layer, a first pooling layer, first to third residual layers, a first modal attention layer, a fourth residual layer, a second modal attention layer and a part alignment attention layer; the first convolution layer and the first to fourth residual layers perform feature extraction on the feature of the pedestrian image (including the shallow features and the deep features of the pedestrian image) subjected to dimension reduction layer by layer; the first convolution layer and the first to fourth residual layers are used for learning shallow layer features of the pedestrian image; the first modality attention layer is used for learning deep features of pedestrian images under two modalities, has the same structure as the second modality attention layer, and consists of two second convolution layers with convolution kernel size of 1, a ReLU activation function and a Sigmod activation function; the calculation formulas of the first modality attention layer and the second modality attention layer are as follows:
Figure BDA0004045364000000091
wherein Z represents the depth characteristics of the obtained pedestrian image,
Figure BDA0004045364000000092
representing the matrix of Z after instance normalization, m C For a channel mask, representing identity-related channels, m C The calculation formula of (2) is m C =σ(W 2 δ(W 1 g (Z)); g (·) represents the global average pooling layer, δ (·) represents the ReLU activation function, σ (·) represents the Sigmod activation function, W 1 and W2 Two fully connected layers in the modal attention layer are represented respectively, and are located after the ReLU activation function and the Sigmod activation function.
The part alignment attention layer is used for exploring a small gap between two modes, dividing the global pedestrian image feature into six blocks, and obtaining a final pedestrian image feature through the feature vector obtained by combining the global pedestrian image feature and the local pedestrian image feature and the self-adaptive average pooling layer processing;
the context-based rotation self-supervision network is used for performing self-supervision learning on the context-based rotation self-supervision image obtained in the step B1, and finally obtaining a first probability matrix for rotation angle prediction; the context-based rotation self-supervision network sequentially comprises a third convolution layer, a second pooling layer, a third modal attention layer, a fifth residual layer, a fourth modal attention layer, a global average pooling layer, a BN layer and a first full-connection layer, wherein the third modal attention layer and the fourth modal attention layer have the same structure as the first modal attention layer;
in the invention, a context-based rotation self-supervision network applies a set of random geometric transformations to randomly rotate an input context-based rotation self-supervision image. Each random rotation self-supervision image corresponds to a pseudo tag, the context-based rotation self-supervision network is used for identifying the rotation angle of the random rotation self-supervision image, and if the context-based rotation self-supervision network cannot capture deep features of pedestrian images in the rotation self-supervision image, the context-based rotation self-supervision network cannot identify the rotation angle of the rotation self-supervision image. In the invention, the context-based rotating self-supervision network and the contrast learning-based self-supervision network share the weight of a main network, the context-based rotating self-supervision network is also provided with independent output, and the deep features of the pedestrian images of the rotated rotating self-supervision images are flattened into a vector with the dimension of 4 for classification through a global average pooling layer, a BN layer and a full connection layer, so that a first probability matrix for predicting the rotation angle is finally obtained to accurately identify the rotation angle of the rotating self-supervision images; dimension 4 represents the label corresponding to the four rotation angles.
The self-supervision network based on contrast learning is used for carrying out self-supervision learning on the contrast-based self-supervision image obtained in the step B3, and finally a second probability matrix for contrast self-supervision learning is obtained; the structure of the self-supervision network based on contrast learning is to add a second full-connection layer for color image classification on the basis of a backbone network, and the second full-connection layer is positioned behind the part alignment attention layer.
In the invention, the self-supervision network based on contrast learning is used for obtaining images with different color effects of the same person for a given visible light image through data enhancement, namely, channel random erasing and channel exchange are carried out on three channels R, G and B of the visible light image, and then after the characteristics of the pedestrian image are extracted through a main network sharing weight parameters, a second probability matrix for contrast self-supervision learning is obtained through a second full-connection layer.
Because the consistency constraint is only considered for the image classifiers (namely the full connection layers) of different modes in the supervised task, the backbone network only learns the shallow features of the pedestrian images among different modes, but the shallow features of the pedestrian images among the same modes are not considered. Because the invention learns the shallow layer characteristics of the pedestrian images between different modes and between the same modes through the self-supervision network based on contrast learning, the invariance between the input image with supervision training and the pedestrian image after enhancement processing can be well learned.
D: b, constructing a training set by utilizing the pedestrian image subjected to the enhancement processing in the step B, wherein the training set comprises a marked sample and an unmarked sample, the marked sample is used for obtaining final pedestrian image characteristics for supervised training through backbone network learning, and the unmarked sample is used for obtaining a first probability matrix for rotation angle prediction and a second probability matrix for contrast self-supervision learning through a context-based rotation self-supervision network and a contrast learning self-supervision network respectively;
e: d, performing a pedestrian re-recognition task based on self-supervision through a main network and a self-supervision training network based on semi-supervision cross-mode pedestrian re-recognition of double self-supervision learning by using the final pedestrian image characteristics for supervised training, a first probability matrix for rotation angle prediction and a second probability matrix for comparison self-supervision learning, wherein the final pedestrian re-recognition task is performed based on self-supervision;
in the step E, the final pedestrian image characteristics of the input image with supervision training are updated by using back propagation through a set first cross entropy loss function and a center loss function;
for the final pedestrian image characteristics of the input image with the supervision training, the supervision training is carried out through a first cross entropy loss function and a center loss function, and the cross entropy loss is carried out
Figure BDA0004045364000000111
The calculation formula of (2) is as follows:
Figure BDA0004045364000000112
wherein n and m respectively represent the number of visible light mode and infrared mode images in the current batch, f v ,f r Respectively representing the pedestrian image characteristics of the visible light mode and the pedestrian image characteristics of the infrared mode,
Figure BDA0004045364000000113
and />
Figure BDA0004045364000000114
Respectively represent f v ,f r Corresponding image tag, C (f v) and C(fr ) The probability matrix obtained by the classifier with the two parameters theta is respectively represented by the pedestrian image characteristics of the two modes, and P (·) is a softmax function (normalized exponential function);
center loss function
Figure BDA0004045364000000121
The calculation formula of (2) is as follows: />
Figure BDA0004045364000000122
wherein ,fi Representing the characteristics of the image of the pedestrian,
Figure BDA0004045364000000123
representing the currentBatch label y i Mean value of features of>
Figure BDA0004045364000000124
Indicating that the current batch label is y k Mean value of features of>
Figure BDA0004045364000000125
Indicating that the current batch label is y j T is the number of pedestrians in the current batch, ρ is the minimum spacing between all centers;
judging the rotation angle through a second cross entropy loss function based on the rotation self-supervision network of the context, and finally outputting an average precision result of pedestrian detection by the backbone network, wherein the average precision result is used for evaluating the accuracy rate of pedestrian re-identification;
in the invention, the context-based rotation self-supervision network outputs a first probability matrix through a second cross entropy loss function
Figure BDA0004045364000000126
And (3) performing calculation, wherein the formula is as follows:
Figure BDA0004045364000000127
wherein ,
Figure BDA0004045364000000128
representing an image after random rotation, +.>
Figure BDA0004045364000000129
The mark is generated by the random rotation angle of the image, and R represents the total number of image samples in one batch;
based on a second probability matrix output by the self-supervision network of contrast learning, the KL divergence is used as a consistency constraint loss function:
Figure BDA00040453640000001210
wherein ,p(xi ) Representing a probability matrix obtained by a color image classifier in supervised learning, q (x) i ) And representing a probability matrix obtained after the self-supervision characteristic based on comparison passes through the classifier.
In the invention, the average precision result and Rank-1 (first matching average correct rate) obtained by the semi-supervised cross-mode pedestrian re-recognition method based on double self-supervised learning are respectively improved by 5.9 percent (85.97% -80.07 percent) and 8.27 percent (91.07% -82.8 percent) on a data set regDB, and the average precision result and Rank-1 are respectively improved by 1.35 percent (82.3% -80.95 percent) and 1.96 percent (78.7% -76.74 percent) in an indoor scene in a data set SYSU. The invention not only successfully applies the unsupervised learning to the pedestrian re-recognition field, but also proves that the semi-supervised cross-mode pedestrian re-recognition method based on the dual self-supervised learning can enhance the recognition robustness and effectively improve the accuracy of pedestrian re-recognition on a plurality of data sets.

Claims (10)

1. The semi-supervised cross-mode pedestrian re-identification method based on double self-supervised learning is characterized by comprising the following steps of:
a: constructing a cross-mode pedestrian re-identification data set, and preprocessing pedestrian images in the cross-mode pedestrian re-identification data set to obtain an input image with supervision training;
b: performing data enhancement processing on pedestrian images in the cross-mode pedestrian re-identification data set to obtain pedestrian images after the data enhancement processing, wherein the pedestrian images after the data enhancement processing comprise a rotating self-supervision image based on context and a self-supervision image based on contrast;
c: constructing a trunk network and a self-supervision training network of semi-supervision cross-mode pedestrian re-recognition based on double self-supervision learning; the self-supervision training network comprises a context-based rotating self-supervision network and a contrast learning-based self-supervision network; the backbone network, the context-based rotating self-monitoring network and the self-monitoring network based on contrast learning are arranged in parallel and share network weights;
the main network is used for performing supervised learning on the input image with the supervised training to acquire final pedestrian image characteristics; the context-based rotation self-supervision network is used for performing self-supervision learning on the context-based rotation self-supervision image to obtain a first probability matrix for rotation angle prediction; the self-supervision network is used for carrying out self-supervision learning on the self-supervision image based on contrast learning to obtain a second probability matrix for contrast self-supervision learning;
d: b, constructing a training set by utilizing the pedestrian image subjected to the enhancement processing in the step B, wherein the training set comprises a marked sample and an unmarked sample, the marked sample is used for obtaining final pedestrian image characteristics for supervised training through backbone network learning, and the unmarked sample is used for obtaining a first probability matrix for rotation angle prediction and a second probability matrix for contrast self-supervision learning through a context-based rotation self-supervision network and a contrast learning self-supervision network respectively;
e: and D, performing a pedestrian re-recognition task based on self-supervision through a main network and a self-supervision training network based on semi-supervision cross-mode pedestrian re-recognition of double self-supervision learning by using the final pedestrian image characteristics for supervised training, a first probability matrix for rotation angle prediction and a second probability matrix for comparison self-supervision learning, and outputting a final recognition result.
2. The semi-supervised cross-modal pedestrian re-recognition method based on dual self-supervised learning as set forth in claim 1, wherein the step a includes the following specific steps:
a1: constructing a cross-mode pedestrian re-recognition data set, acquiring pedestrian images in a training set in the cross-mode pedestrian re-recognition data set, and setting the total number of images input by a network model of a semi-supervised cross-mode pedestrian re-recognition method based on double self-supervision learning;
a2: the method comprises the steps of performing size adjustment on pedestrian images in a cross-mode pedestrian re-identification data set, and adjusting the width and the height of the pedestrian images to be the same size;
a3: c, randomly and horizontally overturning the pedestrian image with the size adjusted in the step A2;
a4: c, filling pixels in the pedestrian image subjected to random horizontal overturn in the step A3;
a5: randomly cutting the pedestrian image filled in the step A4;
a6: carrying out normalization processing on the pedestrian image subjected to random clipping in the step A5;
a7: and C, carrying out channel random erasure on the pedestrian image subjected to normalization processing in the step A6 to obtain an input image with supervision training.
3. The semi-supervised cross-modal pedestrian re-recognition method based on dual self-supervised learning as set forth in claim 2, wherein the step B includes the specific steps of:
b1: sequentially selecting one angle from the rotation angle set {0,90,180,270} for each pedestrian image with the size adjusted, and generating a pseudo tag for each pedestrian image with the rotated angle to obtain a context-based rotation self-supervision image;
b2: carrying out channel random erasing on the obtained input image with supervision training;
b3: and using channel exchange for the pedestrian image after the random erasure of the channel is completed, and obtaining a self-supervision image based on contrast.
4. The dual self-supervised learning-based semi-supervised cross-modal pedestrian re-recognition method as set forth in claim 1, wherein: in the step C, the backbone network sequentially comprises a first convolution layer, a first pooling layer, first to third residual layers, a first modal attention layer, a fourth residual layer, a second modal attention layer and a part alignment attention layer; the first convolution layer and the first to fourth residual layers perform feature extraction on the pedestrian image features after the dimension reduction layer by layer, and learn shallow features of the pedestrian image; the first modality attention layer is used for learning deep features of pedestrian images in two modalities; the part alignment attention layer is used for exploring a small gap between a visible light mode and an infrared mode to obtain final pedestrian image characteristics.
5. The semi-supervised cross-modal pedestrian re-recognition method based on dual self-supervised learning as set forth in claim 4, wherein: the first modality attention layer and the second modality attention layer have the same structure and are composed of two second convolution layers with the convolution kernel size of 1, a ReLU activation function and a Sigmod activation function; the calculation formulas of the first modality attention layer and the second modality attention layer are as follows:
Figure FDA0004045363990000031
wherein Z represents the depth characteristics of the obtained pedestrian image,
Figure FDA0004045363990000032
representing the matrix of Z after instance normalization, m C For a channel mask, representing identity-related channels, m C The calculation formula of (2) is m C =σ(W 2 δ(W 1 g (Z)); g (·) represents the global average pooling layer, δ (·) represents the ReLU activation function, σ (·) represents the Sigmod activation function, W 1 and W2 Two fully connected layers in the modal attention layer are represented respectively, and are located after the ReLU activation function and the Sigmod activation function.
6. The dual self-supervised learning-based semi-supervised cross-modal pedestrian re-recognition method as set forth in claim 1, wherein: the context-based rotation self-supervision network sequentially comprises a third convolution layer, a second pooling layer, a third modal attention layer, a fifth residual error layer, a fourth modal attention layer, a global average pooling layer, a BN layer and a first full-connection layer, wherein the third modal attention layer and the fourth modal attention layer have the same structure as the first modal attention layer.
7. The dual self-supervised learning-based semi-supervised cross-modal pedestrian re-recognition method as set forth in claim 1, wherein: the self-supervision network structure based on contrast learning is to add a second full-connection layer on the basis of a backbone network, and the second full-connection layer is positioned behind the part alignment attention layer.
8. The dual self-supervised learning-based semi-supervised cross-modal pedestrian re-recognition method as set forth in claim 1, wherein: in the step E, the final pedestrian image characteristics of the input image with the supervision training are updated by using the back propagation through a set first cross entropy loss function and a center loss function;
cross entropy loss
Figure FDA0004045363990000033
The calculation formula of (2) is as follows:
Figure FDA0004045363990000041
wherein n and m respectively represent the number of visible light mode and infrared mode images in the current batch, f v ,f r Respectively representing the pedestrian image characteristics of the visible light mode and the pedestrian image characteristics of the infrared mode,
Figure FDA0004045363990000042
and />
Figure FDA0004045363990000043
Respectively represent f v ,f r Corresponding image tag, C (f v) and C(fr ) Respectively representing probability matrixes obtained by the pedestrian image characteristics of two modes through classifiers with the two parameters of theta, wherein P (·) is a softmax function;
center loss function
Figure FDA0004045363990000044
The calculation formula of (2) is as follows: />
Figure FDA0004045363990000045
wherein ,fi Representing the characteristics of the image of the pedestrian,
Figure FDA0004045363990000046
indicating that the current batch label is y i Mean value of features of>
Figure FDA0004045363990000047
Indicating that the current batch label is y k Mean value of features of>
Figure FDA0004045363990000048
Indicating that the current batch label is y j T is the number of pedestrians in the current lot and ρ is the minimum spacing between all centers.
9. The dual self-supervised learning-based semi-supervised cross-modal pedestrian re-recognition method as set forth in claim 1, wherein: in the step E, the rotation angle is judged by a second cross entropy loss function based on the rotation self-supervision network of the context, and finally, the average precision result of pedestrian detection is output by the backbone network and is used for evaluating the accuracy rate of pedestrian re-identification;
second cross entropy loss function
Figure FDA0004045363990000049
The calculation formula of (2) is as follows:
Figure FDA00040453639900000410
wherein ,
Figure FDA00040453639900000411
representing an image after random rotation, +.>
Figure FDA00040453639900000412
Is a mark generated by random rotation angles of images, and R represents the total number of image samples in one batch.
10. The dual self-supervised learning-based semi-supervised cross-modal pedestrian re-recognition method as set forth in claim 1, wherein: in the step E, based on a second probability matrix output by the self-supervision network of the comparison learning, the KL divergence is used as a consistency constraint loss function;
Figure FDA00040453639900000413
wherein ,p(xi ) Representing a probability matrix obtained by a color image classifier in supervised learning, q (x) i ) And representing a probability matrix obtained after the self-supervision characteristic based on comparison passes through the classifier.
CN202310027835.5A 2023-01-09 2023-01-09 Semi-supervised cross-mode pedestrian re-recognition method based on dual self-supervised learning Pending CN116052212A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310027835.5A CN116052212A (en) 2023-01-09 2023-01-09 Semi-supervised cross-mode pedestrian re-recognition method based on dual self-supervised learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310027835.5A CN116052212A (en) 2023-01-09 2023-01-09 Semi-supervised cross-mode pedestrian re-recognition method based on dual self-supervised learning

Publications (1)

Publication Number Publication Date
CN116052212A true CN116052212A (en) 2023-05-02

Family

ID=86115978

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310027835.5A Pending CN116052212A (en) 2023-01-09 2023-01-09 Semi-supervised cross-mode pedestrian re-recognition method based on dual self-supervised learning

Country Status (1)

Country Link
CN (1) CN116052212A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116543268A (en) * 2023-07-04 2023-08-04 西南石油大学 Channel enhancement joint transformation-based countermeasure sample generation method and terminal
CN116612439A (en) * 2023-07-20 2023-08-18 华侨大学 Balancing method for modal domain adaptability and feature authentication and pedestrian re-identification method
CN116824695A (en) * 2023-06-07 2023-09-29 南通大学 Pedestrian re-identification non-local defense method based on feature denoising
CN117351518A (en) * 2023-09-26 2024-01-05 武汉大学 Method and system for identifying unsupervised cross-modal pedestrian based on level difference

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116824695A (en) * 2023-06-07 2023-09-29 南通大学 Pedestrian re-identification non-local defense method based on feature denoising
CN116543268A (en) * 2023-07-04 2023-08-04 西南石油大学 Channel enhancement joint transformation-based countermeasure sample generation method and terminal
CN116543268B (en) * 2023-07-04 2023-09-15 西南石油大学 Channel enhancement joint transformation-based countermeasure sample generation method and terminal
CN116612439A (en) * 2023-07-20 2023-08-18 华侨大学 Balancing method for modal domain adaptability and feature authentication and pedestrian re-identification method
CN116612439B (en) * 2023-07-20 2023-10-31 华侨大学 Balancing method for modal domain adaptability and feature authentication and pedestrian re-identification method
CN117351518A (en) * 2023-09-26 2024-01-05 武汉大学 Method and system for identifying unsupervised cross-modal pedestrian based on level difference
CN117351518B (en) * 2023-09-26 2024-04-19 武汉大学 Method and system for identifying unsupervised cross-modal pedestrian based on level difference

Similar Documents

Publication Publication Date Title
CN109949317B (en) Semi-supervised image example segmentation method based on gradual confrontation learning
Wang et al. Detect globally, refine locally: A novel approach to saliency detection
CN111767882A (en) Multi-mode pedestrian detection method based on improved YOLO model
Andrearczyk et al. Convolutional neural network on three orthogonal planes for dynamic texture classification
WO2019169816A1 (en) Deep neural network for fine recognition of vehicle attributes, and training method thereof
CN116052212A (en) Semi-supervised cross-mode pedestrian re-recognition method based on dual self-supervised learning
Jin et al. Pedestrian detection with super-resolution reconstruction for low-quality image
CN110929593B (en) Real-time significance pedestrian detection method based on detail discrimination
CN113642634A (en) Shadow detection method based on mixed attention
CN111582095B (en) Light-weight rapid detection method for abnormal behaviors of pedestrians
CN110390308B (en) Video behavior identification method based on space-time confrontation generation network
CN109740539B (en) 3D object identification method based on ultralimit learning machine and fusion convolution network
CN113221655B (en) Face spoofing detection method based on feature space constraint
CN111666852A (en) Micro-expression double-flow network identification method based on convolutional neural network
CN113553954A (en) Method and apparatus for training behavior recognition model, device, medium, and program product
CN116452937A (en) Multi-mode characteristic target detection method based on dynamic convolution and attention mechanism
Parde et al. Deep convolutional neural network features and the original image
CN112329771A (en) Building material sample identification method based on deep learning
Chen et al. Generalized face antispoofing by learning to fuse features from high-and low-frequency domains
CN115862055A (en) Pedestrian re-identification method and device based on comparison learning and confrontation training
CN116091946A (en) Yolov 5-based unmanned aerial vehicle aerial image target detection method
CN115063832A (en) Global and local feature-based cross-modal pedestrian re-identification method for counterstudy
CN110751005B (en) Pedestrian detection method integrating depth perception features and kernel extreme learning machine
CN114596548A (en) Target detection method, target detection device, computer equipment and computer-readable storage medium
CN114550110A (en) Vehicle weight identification method and system based on unsupervised domain adaptation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination