CN113887382B - RGB-D-based cross-mode pedestrian re-identification method, storage medium and device - Google Patents

RGB-D-based cross-mode pedestrian re-identification method, storage medium and device Download PDF

Info

Publication number
CN113887382B
CN113887382B CN202111148969.XA CN202111148969A CN113887382B CN 113887382 B CN113887382 B CN 113887382B CN 202111148969 A CN202111148969 A CN 202111148969A CN 113887382 B CN113887382 B CN 113887382B
Authority
CN
China
Prior art keywords
local
rgb
feature
image
depth image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111148969.XA
Other languages
Chinese (zh)
Other versions
CN113887382A (en
Inventor
吴晶晶
蒋建国
齐美彬
尤小泉
庄硕
李小红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hefei University of Technology
Original Assignee
Hefei University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hefei University of Technology filed Critical Hefei University of Technology
Priority to CN202111148969.XA priority Critical patent/CN113887382B/en
Publication of CN113887382A publication Critical patent/CN113887382A/en
Application granted granted Critical
Publication of CN113887382B publication Critical patent/CN113887382B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a RGB-D-based cross-mode pedestrian re-identification method, which is characterized in that global features of a depth image and an RGB image are extracted through global branches of the depth image and the RGB image respectively, and local features of the depth image and the RGB image are extracted through local branches of the depth image and the RGB image; the cross-modal pedestrian re-recognition network is trained by fusing the global loss function and the local loss function. And utilizing the trained cross-mode pedestrian re-recognition network to conduct pedestrian re-recognition. The method improves the recognition accuracy by fully utilizing the relation between the two modes of the depth image and the RGB image.

Description

RGB-D-based cross-mode pedestrian re-identification method, storage medium and device
Technical Field
The invention belongs to the technical field of computer vision, and particularly relates to a cross-mode pedestrian re-identification method based on a depth image and an RGB image.
Background
Given a Depth image (D) of the pedestrian of interest, RGB-D cross-modality pedestrian re-identifies the target pedestrian that is intended to be retrieved from the RGB pedestrian candidate library. Each pixel value in the depth map describes a depth value (distance value) of the scene corresponding to the point. Images of this modality are not susceptible to light variations. When the light condition is poor and the RGB image can not be shot, the image can be obtained to replace the RGB image. Compared with the traditional pedestrian re-identification based on RGB images, the RGB-D cross-mode pedestrian re-identification can realize the identification of pedestrians under the condition of poor light conditions such as the night. Therefore, the task is more in line with the actual requirements and can be applied to a wider range of actual scenes.
RGB-D cross-modality pedestrian re-recognition requires matching of pedestrians between pedestrian images of different modalities. In addition to challenges in RGB pedestrian re-recognition, such as changes in pedestrian viewing angles, cluttered backgrounds, shielding of pedestrians, etc., the vast differences between different modalities make recognition across modalities more difficult. Document 1: peng Zhang, jinsong Xu, qiang Wu, yan Huang, and Jian Zhang.2019.Top-push constrained modality-adaptive dictionary learning for cross-modality person re-identification.IEEE Transactions on Circuits and Systems for Video Technology, 12 (2019), 4554-4566, and document 2: the Jiaxuan Zhuo, junyong Zhu, jian huang Lai, and Jiao hua Xie.2017.Person re-identification on heterogeneous camera network.In CCF Chinese Conference on Computer Vision. Springer,280-291 recognizes pedestrians by extracting manually designed features, and lacks abstract semantic expression. Document 3: the Frank M Hafner, amban Bhuiyan, julian FP Kooij, and Eric granger.2019.RGB-depth cross-model person re-identification.in 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). IEEE,1-8. The method comprises the steps of training a depth map single-mode pedestrian and then identifying a network. And then training an RGB image recognition network, fixing parameters of the depth map network at the moment, and restraining the features of the RGB image to be similar to the features of the depth map corresponding to the RGB image one by utilizing a distillation learning technology as far as possible, so that the difference between modes is reduced. Although the method restricts the features of the cross-modal image pairs corresponding to one as similar as possible through a two-stage learning process, so that the difference between two modalities is reduced, and the accuracy of cross-modal identification is improved. But it ignores the rich relationships contained in any other cross-modal image pair. And in this method, the training process of the whole network is more complicated by two times of training. Furthermore, it uses only a simple global feature extraction network to obtain the features of pedestrians, ignoring the utilization of spatial context information. The spatial relationships contain abundant appearance details and pedestrian body shape information, which are important for improving the representation capability of the image, especially for the depth image only providing single-channel information.
Disclosure of Invention
The invention aims to: the invention provides a cross-mode pedestrian re-identification method based on a depth image and an RGB image, aiming at acquiring the relation between the two modes of the depth image and the RGB image and improving the identification precision.
The technical scheme is as follows: the invention provides a RGB-D-based cross-mode pedestrian re-identification method, which comprises a training stage and an identification stage, wherein the training stage comprises the following steps:
s1, constructing a cross-mode pedestrian re-identification network, wherein the pedestrian re-identification network comprises a global sub-network 100 and a local sub-network 200, and the global sub-network 100 comprises a depth image global branch 110, an RGB image global branch 120 and an identification branch 130; the local subnetwork 200 comprises a depth image local branch 210, an RGB image local branch 220 and a local Triplet loss function calculation module;
the depth image global branch 110 is used for extracting deep features G of the depth image D The method comprises the steps of carrying out a first treatment on the surface of the Comprises a first shallow feature extraction module which is connected in sequence and is used for extracting shallow features FS of an input depth image D The method comprises the steps of carrying out a first treatment on the surface of the A first deep feature extraction module for extracting the first deep feature from FS D Deep feature G of extracted depth image D The method comprises the steps of carrying out a first treatment on the surface of the A first average pooling module for using the average pooling method to pool the deep features G D Global feature FP of a depth image obtained from a camera D The method comprises the steps of carrying out a first treatment on the surface of the A first batch of normalization modules for global features FP D Batch standardization processing is carried out, and category characteristics FB of depth images are obtained through the full connection layer D The method comprises the steps of carrying out a first treatment on the surface of the A first softmax function calculation module for calculating category characteristics FB according to the real category labels of the training samples D Is a softmax loss function L of (2) gs1
The global RGB image branch 120 is used for extracting deep features G of RGB images V The method comprises the steps of carrying out a first treatment on the surface of the Comprises a second shallow feature extraction module which is connected in sequence and is used for extracting shallow features FS of an input RGB image V The method comprises the steps of carrying out a first treatment on the surface of the A second deep feature extraction module for extracting the feature from FS V Deep feature G of RGB image V The method comprises the steps of carrying out a first treatment on the surface of the A second average pooling module for using the average pooling method to pool the deep features G V Global feature FP of an RGB image V The method comprises the steps of carrying out a first treatment on the surface of the A second batch of normalization modules for global features FP V Batch standardization processing is carried out, and category characteristics FB of RGB images are obtained through the full connection layer V The method comprises the steps of carrying out a first treatment on the surface of the A second softmax function calculation module for calculating the category feature FB according to the real category label of the training sample V Is a softmax loss function L of (2) gs2
The identification branch 130 is configured to identify deep features G according to the depth image D And deep features G of RGB images V Obtaining the similarity C of the depth image and the RGB image d,v According to the similarity C d,v Obtaining a recognition result according to the global characteristic FP of the depth image D And global feature FP of RGB images V Computing global Triplet loss function L gt
The local depth image branch 210 is used for extracting deep features G of the depth image D Contextual local feature FG of (C) D The method comprises the steps of carrying out a first treatment on the surface of the Comprises a first local feature extraction module which is connected in sequence and is used for extracting deep features G of a depth image D Local feature FL of (2) D The method comprises the steps of carrying out a first treatment on the surface of the A first adaptive pooling module for obtaining local feature FL D Attention feature FA of (a) D The method comprises the steps of carrying out a first treatment on the surface of the A first local loss function calculation module for calculating a local softmax loss function L of the depth image ls1
The RGB image local branch 220 is used for extracting deep features G of the RGB image V Context local feature FB of (1) V The method comprises the steps of carrying out a first treatment on the surface of the Comprises a second local feature extraction module which is connected in sequence and is used for extracting deep features G of RGB images V Local feature FL of (2) V The method comprises the steps of carrying out a first treatment on the surface of the A second adaptive pooling module for obtaining the local feature FL V Attention feature FA of (a) V The method comprises the steps of carrying out a first treatment on the surface of the A second local loss function calculation module for calculating a local softmax loss function L of the RGB image ls2
The local Triplet loss function calculation module is used for calculating local attention characteristic FA according to the depth image D And a local attention feature FA of an RGB image V Calculating a local Triplet loss function L lt
S2, training the network constructed in the step S2 by adopting a training set, wherein samples in the training set are depth images marked with categories and corresponding RGB images; the trained loss function is:
where the superscript τ denotes the number of epochs currently trained.
The identification phase comprises:
s3, respectively inputting the depth image and the RGB image in the cross-modal image pair to be identified into the depth image global branch 110 and the RGB image global branch 120, and identifying the branch 130 to obtain the similarity C of the depth features of the depth image and the RGB image d,v And according to the similarity C d,v And obtaining a recognition result.
In another aspect, the present invention discloses a computer readable storage medium having stored thereon computer instructions that, when executed, perform the above-described cross-modality pedestrian re-recognition method.
On the other hand, the invention discloses a cross-mode pedestrian recognition device, which comprises a processor and a storage medium, wherein the storage medium is the computer readable storage medium; the processor loads and executes the instructions and data in the storage medium for implementing the above-described cross-modality pedestrian re-identification method.
The beneficial effects are that: compared with the prior art, the RGB-D cross-modal pedestrian re-recognition method based on cross-modal constraint disclosed by the invention obtains the recognition result through the relation between the depth image and the deep features of the RGB image, and fuses global and local error functions under two modes to train the whole recognition network, thereby fully utilizing the relation among the cross-modes and improving the network recognition precision.
Drawings
FIG. 1 is a diagram of a cross-modality pedestrian re-identification network in embodiment 1;
fig. 2 is a structural diagram of a global subnetwork in embodiment 2;
FIG. 3 is a block diagram of a cross-modal relationship network in embodiment 2;
fig. 4 is a partial subnetwork structure diagram in embodiment 3;
FIG. 5 is a diagram of a cross-modality pedestrian re-identification network in embodiment 3;
fig. 6 is a schematic diagram of a cross-modality pedestrian recognition device of the present disclosure.
Detailed Description
The invention is further elucidated below in connection with the drawings and the detailed description.
Example 1:
the embodiment discloses a RGB-D cross-modal pedestrian re-identification method based on cross-modal constraint, which comprises a training phase and an identification phase, wherein the training phase comprises the following steps:
s1, constructing a cross-mode pedestrian re-recognition network, wherein the pedestrian re-recognition network comprises a global sub-network 100 and a local sub-network 200, and the global sub-network 100 comprises a depth image global branch 110, an RGB image global branch 120 and a recognition branch 130; the local subnetwork 200 includes a depth image local branch 210, an RGB image local branch 220, a local Triplet loss function calculation module.
The depth image global branch 110 is used for extracting deep features G of the depth image D The method comprises the steps of carrying out a first treatment on the surface of the Comprises a first shallow feature extraction module which is connected in sequence and is used for extracting shallow features FS of an input depth image D The method comprises the steps of carrying out a first treatment on the surface of the A first deep feature extraction module for extracting the first deep feature from FS D Deep feature G of extracted depth image D The method comprises the steps of carrying out a first treatment on the surface of the A first average pooling module for using the average pooling method to pool the deep features G D Global feature FP of a depth image obtained from a camera D The method comprises the steps of carrying out a first treatment on the surface of the A first batch of normalization modules for global features FP D Batch normalization (Batch Normalization) is performed, and category feature FB of depth image is obtained through full connection layer D The method comprises the steps of carrying out a first treatment on the surface of the A first softmax function calculation module for calculating category characteristics FB according to the real category labels of the training samples D Is a softmax loss function L of (2) gs1
The global RGB image branch 120 is used for extracting deep features G of RGB images V The method comprises the steps of carrying out a first treatment on the surface of the Comprises a second shallow layer feature extraction module which is connected in sequenceShallow feature FS for extracting input RGB image V The method comprises the steps of carrying out a first treatment on the surface of the A second deep feature extraction module for extracting the feature from FS V Deep feature G of RGB image V The method comprises the steps of carrying out a first treatment on the surface of the A second average pooling module for using the average pooling method to pool the deep features G V Global feature FP of an RGB image V The method comprises the steps of carrying out a first treatment on the surface of the A second batch of normalization modules for global features FP V Batch normalization (Batch Normalization) is performed and class feature FB of RGB image is obtained through full connection layer V The method comprises the steps of carrying out a first treatment on the surface of the A second softmax function calculation module for calculating the category feature FB according to the real category label of the training sample V Is a softmax loss function L of (2) gs2
In this embodiment, the first shallow feature extraction module and the second shallow feature extraction module are both composed of an initial layer of Resnet-50 and block 1. For details of Resnet-50, see: kaiming He, xiangyu Zhang, shaoqing Ren, and Jian sun 2016.Deep residual learning for image recognment. In Proceedings of the IEEE conference on computer vision and pattern recognment. 770-778. Shallow feature extraction is not shared between two modes of a depth image and an RGB image, and aims to extract features of specific modes from images of the two modes respectively.
The first deep feature extraction module and the second deep feature extraction module are weight sharing and aim to extract common depth features of two modes, so that differences between the two modes are reduced. In this embodiment, the structure is composed of blocks 2-4 of the Resnet-50 network.
Global features FP obtained by the first averaging pooling module D After the first batch of standardized modules and the full connection layer processing, category characteristics FB are obtained D ,FB D ∈R u U is the number of pedestrian categories.
FB D The softmax loss function L is calculated by inputting the calculated softmax loss function L into a first softmax loss function calculation module gs1 The method comprises the following specific steps of:
the probability of belonging to each class k predicted for the training samples is calculated first, k e {1,2, …, u }:
wherein FB is D,k Is FB D Values on the kth channel; then the loss function L gs1 The calculation is as follows:
assuming that g is the true class of training samples, q (k) =1 when k=g; otherwise q (k) =0. L (L) gs1 The probability of the correct category can be maximized.
Likewise, the global features FP obtained by the second averaging pooling module V After the second batch of standardized modules and the full connection layer processing, category characteristics FB are obtained V ,FB V ∈R u 。FB V The input is input into a second softmax loss function calculation module to calculate a softmax loss function L gs2
The identification branch 130 is used for identifying deep features G according to the depth image D And deep features G of RGB images V Obtaining the similarity C of the depth image and the RGB image d,v And according to the similarity C d,v Obtaining a recognition result according to the global characteristic FP of the depth image D And global feature FP of RGB images V Computing global Triplet loss function L gt
In this embodiment, the similarity C is calculated by the similarity calculation module d,v In particular, deep features G of depth images are employed D And deep features G of RGB images V Euclidean distance of (a), namely:wherein G is D And G V Is all N in dimension D ,G D,n 、G V,n G respectively D And G V Is the nth element of (a).
The more similar the two vectors are, the smaller the euclidean distance of the two. Thereby, the similarity C d,v The smaller the value, the depth image is inputThe greater the likelihood that the same pedestrian will be the RGB image. The recognition result can be expressed by the following two ways:
mode one:
if C d,v <R th The input depth image and the RGB image are of the same kind, namely the same pedestrian; the other cases consider the input depth image and RGB image as different classes. Wherein R is th The first recognition judgment threshold value can be obtained through multiple experimental statistics.
Mode two:
the depth image of the pedestrian to be identified and the RGB image in the RGB pedestrian candidate library form a plurality of depth images and RGB image pairs, each image pair is input into a cross-mode pedestrian re-identification network respectively, and the similarity (depth characteristic G of the depth image in the embodiment) of each image pair is obtained D And deep features G of RGB images V Euclidean distance) of the image pair with the smallest similarity value, the input depth image and the RGB image are the most likely to be of the same kind. In this embodiment, the identification branch 130 uses a global Triplet loss function calculation module to calculate the global feature FP according to the depth image D And global feature FP of RGB images V Calculating a Triplet loss function L gt
Global features FP obtained by the first averaging pooling module D And global features FP obtained by a second averaging pooling module V As input to the global Triplet loss function calculation module, according to literature: alexander Hermans, lucas Beyer, and Bastin Leibe.2017.In defense of the Triplet loss for person re-identification.arXiv preprint arXiv:1703.07737 (2017), calculated Triplet loss function L gt The method comprises the following steps: l (L) gt =max(d p -d n +β,0)。
Wherein d is p And d n The distance between the features of the positive and negative pairs, respectively, β is the boundary of the Triplet loss function, β=0.3 in this example.
The depth image local branch 210 is used for extracting deep features G of the depth image D Contextual local feature FG of (C) D The method comprises the steps of carrying out a first treatment on the surface of the Comprising first local features connected in sequenceAn extraction module for extracting deep features G of the depth image D Local feature FL of (2) D The method comprises the steps of carrying out a first treatment on the surface of the A first adaptive pooling module for obtaining local feature FL D Attention feature FA of (a) D The method comprises the steps of carrying out a first treatment on the surface of the A first local loss function calculation module for calculating a local softmax loss function L of the depth image ls1
The RGB image local branch 220 is used for extracting deep features G of the RGB image V Context local feature FB of (1) V The method comprises the steps of carrying out a first treatment on the surface of the Comprises a second local feature extraction module which is connected in sequence and is used for extracting deep features G of RGB images V Local feature FL of (2) V The method comprises the steps of carrying out a first treatment on the surface of the A second adaptive pooling module for obtaining the local feature FL V Attention feature FA of (a) V The method comprises the steps of carrying out a first treatment on the surface of the A second local loss function calculation module for calculating a local softmax loss function L of the RGB image ls2
The local Triplet loss function calculation module is used for calculating local attention characteristic FA according to the depth image D And a local attention feature FA of an RGB image V Calculating a local Triplet loss function L lt
In this embodiment, the first local feature extraction module and the second local feature extraction module are each formed by sequentially connecting three convolution layers with a core size of 3 and a step size of 1, thereby respectively performing deep feature G D 、G V Further coding to obtain FL respectively D And FL (field effect transistor) V
First adaptive pooling module pair FL D Specifically, the treatment of (1) comprises:
FL is first put into D Evenly split into N 1 Block to obtain N 1 Local feature FL D,i ,i=1,2,…,N 1 The method comprises the steps of carrying out a first treatment on the surface of the N in the present embodiment 1 The value of (2) is 4. To obtain more discriminative local features, the present embodiment uses adaptive pooling (Attentive pooling) to downsample FL D,i . Specifically, a spatial distribution FL is adaptively learned for each local feature D,i After normalizing the distribution using the softmax function, the spatial distribution is used as a downsampled weight to the local feature FL D,i Downsampling is performed.This allows an automatic selection of a discriminative spatial portion from the local features to characterize the local region, so that the share of the discriminative region in the local space can be reinforced. The pooling process is performed for each FL D,i Outputting a corresponding local attention feature block FA D,i Together obtain N 1 A local attention feature block.
Second adaptive pooling module pair FL V Performing similar treatment to obtain N 2 A local attention feature block. In the present embodiment, N 1 、N 2 The values of (2) are all 4.
N obtained by the first self-adaptive pooling module 1 The local attention feature blocks are connected and input into a first local loss function calculation module to calculate local Triplet loss functions L of the depth image respectively lt1 And a local softmax loss function L ls1 . Likewise, N obtained by the second adaptive pooling module 2 The local attention feature blocks are connected and input into a second local loss function calculation module to calculate the local Triplet loss function L of the RGB image lt2 And a local softmax loss function L ls2
S2, training the network constructed in the step S2 by adopting a training set, wherein samples in the training set are depth images marked with categories and corresponding RGB images; the trained loss function is:
where the superscript τ denotes the number of epochs currently trained. Loss of contrast L C This operation allows the cross-modal relational modeling of global features to gradually increase their proportion in the overall training after they become more discriminative, inversely proportional to the global loss function value.
The trained cross-mode pedestrian re-recognition network is applied to recognize the cross-module image pair to be recognized:
respectively inputting a depth image and an RGB image in a cross-mode image pair to be identified into a depth image globalThe branch 110 and the global branch 120 of RGB image, and the recognition branch 130 obtains the similarity C of the depth image and the RGB image d,v And according to C d,v And obtaining a recognition result.
Example 2:
the difference between this embodiment and embodiment 1 is that the identification branch 130 in the global sub-network further includes a cross-modal relationship network, a full connection layer, a similarity correction module, and a contrast loss function calculation module; fig. 2 is a schematic diagram of the global subnetwork 100 according to the present embodiment. Cross-modal relational network for deep features G from depth images D And deep features G of RGB images V Obtaining cross-modal relation characteristic G of depth image D,V And cross-modal relationship feature G of RGB images V,D
In this embodiment, the similarity C d,v Deep features G employing depth images D And deep features G of RGB images V Cosine similarity of (c), namely:
unlike in embodiment 1, the more similar the two vectors are, the greater the cosine similarity of the two. Thereby, the similarity C d,v The larger the value, the greater the likelihood that the input depth image and RGB image are the same pedestrian.
As shown in fig. 3, the cross-modal relationship network includes six convolution layers Conv1 to Conv6, deep features G D Features after passing through the first convolution layer Conv1 and deep features G V Performing dot product operation on the features after passing through the second convolution layer Conv2 to obtain a cross-modal attention weight omega;
deep features G of RGB images V After passing through the fourth convolution layer Conv4, multiplying the result by the cross-modal attention weight omega, and after passing through the sixth convolution layer Conv6, obtaining the result which is multiplied by the deep characteristic G D Adding to obtain a cross-modal relation characteristic G of the depth image D,V
Deep features G of depth image D Multiplied by the cross-modal attention weight omega after passing through the third convolution layer Conv3The result is passed through a fifth convolution layer Conv5 and then is combined with deep features G V Adding to obtain a cross-modal relation characteristic G of the RGB image V,D
The full connection layer is used for generating the cross-modal relation characteristic G D,V And G V,D Obtaining the relation value S of the input depth image and RGB image d,v . In order to make the finally obtained relation value S d,v The value range is between 0 and 1, a sigmoid function is added after the input of the full connection layer, and the output of the function is the relation value S of the input depth image and the RGB image d,v
The similarity correction module is used for correcting the relation value S d,v For similarity C d,v Correcting to obtain corrected similarity C' d,v And according to C' d,v And obtaining a recognition result.
In this embodiment, the correction of the similarity adopts a simple superposition method, that is: c'. d,v =C d,v +S d,v Indicated by "# in fig. 2.
The contrast loss function calculation module is used for calculating a contrast loss function according to the relation value S d,v Calculate contrast loss function L C
Contrast loss function L C During the training phase, training samples (d, v) consisting of depth image d and RGB image v are used, wherein whether d and v are of the same type is known, and the labels I are defined d,v The method comprises the following steps:
gamma is the smoothing parameter of the label, which in this embodiment is set to 0.1. The contrast loss function is:
L C =-I d,v log(S d,v )-(1-I d,v )log(1-S d,v )
the loss function can reduce the difference between the two modalities by maximizing the relational features of the cross-modal positive sample pairs. Meanwhile, the method can minimize the relation value of the cross-modal negative sample pair to improve the distinguishing property of the global features.
In this embodiment, the trained loss function is:
wherein E () represents the Triplet loss function L for 1 to τ -1 epochs gt Softmax loss function L gs1 And L gs2 An average value is calculated.
The present embodiment utilizes a cross-modal relationship network to obtain the relationship between any cross-modal sample pair. In order to fully utilize the relationship between the cross-modes, the embodiment applies cross-mode matching constraint to the cross-mode matching relationship, so that the difference between two heterogeneous images is reduced in an end-to-end mode (one-time training process), the network identification accuracy is improved, and meanwhile, the training complexity is reduced.
Similar to embodiment 1, the recognition result is represented by the following two ways:
mode one:
if C' d,v >R′ th The input depth image and the RGB image are of the same kind, namely the same pedestrian; the other cases consider the input depth image and RGB image as different classes. Wherein R 'is' th The second recognition judgment threshold value can be obtained through multiple experimental statistics.
Mode two:
the depth image of the pedestrian to be identified and RGB images in the RGB pedestrian candidate library form a plurality of depth images and RGB image pairs, each image pair is input into a cross-mode pedestrian re-identification network respectively, and the similarity of each image pair is obtained (the similarity in the embodiment is the deep feature G of the depth image) D And deep features G of RGB images V Cosine similarity of (C), and corrected similarity C' d,v C 'then' d,v The highest value image pair has the highest possibility that the input depth image and RGB image are of the same kind.
Example 3:
this embodiment is a further improvement over embodiment 2, and differs from embodiment 2 in that the depth image local arm 210 further comprises a first adaptive map network, a firstAn adaptive network for attention feature FA from depth images D Extracting local features of the fused context; the RGB image partial branch 220 further comprises a second adaptive graph network for generating a second adaptive graph network based on the attention characteristic FA of the RGB image V The local features of the fused context are extracted and the local subnetwork structure is as in figure 4. The cross-modality pedestrian re-recognition network in this embodiment is shown in fig. 5.
The first adaptive graph network has L 1 Layers, each layer comprising N 1 A plurality of nodes; n (N) 1 The number of the local attention feature blocks of the depth image acquired by the first self-adaptive pooling module; ith node in the first layerAnd j-th node->The edges between are adjacent matrix->Where i, j=1, 2, …, N 1 ;|| || 2 To calculate the 2 norms; l=1, 2, …, L 1
Adjacent matrixThe relation between any two nodes is flexibly represented. In order to embed the information of all the graph nodes into the output nodes to update the local spatial characteristics, the output nodes can be richer and more discriminant, so that the accuracy of identification is improved, and the embodiment adopts the following dynamic embedding:
ith node in the first layerThe calculation formula of (2) is as follows: />
Wherein the method comprises the steps ofI-th local attention feature block FA for depth image D,i Alpha is a balance parameter to determine the weight occupied by the characterization of the ith node of the previous layer and the influence of other nodes on the ith node of the current layer during fusion, and in the embodiment, alpha is set to be 0.1.F (F) l Is a first layer parameter of the first adaptive graph network. The adaptive dynamic embedding is performed to enable +.>Consider all nodes of the upper layer +.>Is modified by the influence of (a) on the current layer node after correction>Therefore, the node characteristics of the output layer include the characteristics of the local area and the relation with other areas, which is more abundant.
L th 1 Layer nodeNamely, the local feature FG of the ith fusion context of the depth image D,i The method comprises the steps of carrying out a first treatment on the surface of the FG is to be FG D,i Local feature FG of fusion context that is connected to get depth image D
The second adaptive graph network has L 2 Layers, each layer comprising N 2 A plurality of nodes; n (N) 2 The number of the RGB image local attention feature blocks acquired by the second self-adaptive pooling module; ith node in the first layerAnd j-th node->The edges between are adjacent matrix->Where i, j=1, 2, …, N 2 And i+.j; i 2 To calculate the 2 norms; l=1, 2, …, L 2
Ith node in the first layerThe calculation formula of (2) is as follows: />
Wherein the method comprises the steps ofI-th local attention feature block FA for RGB image V,i Alpha' is a balance parameter, H l Layer 1 parameters of a second self-adaptive graph network;
l th 2 Layer nodeNamely, the local feature FG of the ith fusion context of the RGB image V,i The method comprises the steps of carrying out a first treatment on the surface of the FG is to be FG V,i Local feature FG of fusion context that is connected to get RGB image V
In the present embodiment, L 1 And L 2 All are 2.
Since only a single-channel image is included in the depth image. In the prior art, the whole characteristics of pedestrians are acquired by using a simple global characteristic extraction network, and the utilization of space context information is ignored. The embodiment models the graph structure between different local spaces in each mode image by utilizing the self-adaptive graph rolling network, and improves the identification accuracy by improving the characteristic characterization capability by acquiring rich space context information.
The base is generally uniformly applied by natural e or 2 in the logarithmic calculation in the loss function of examples 1-3.
The recognition effect of the cross-modal pedestrian re-recognition network in the above embodiments 1-3 was tested using the public dataset BIWI and compared with the existing algorithm, and the test results are shown in table 1.
TABLE 1 Experimental results
Methods Recognition accuracy (%)
ICMDL 7.1
Corr.Dict. 11.3
Cross-modal distillation network 29.2
Example 1 43.7
Example 2 45.5
Example 3 47.1
In table 1, ICMDL is identified using the method in document 1; corr. Dict. Is identified using the method in literature 2; cross-modal distillation network is identified by the method described in document 3.
As can be seen from table 1, the recognition average accuracy of the present invention is superior to the current advanced method. Moreover, as can be seen from the experimental results of no cross-modal relation network and no self-adaptive graph network, the performance of the network can be effectively improved by adding the cross-modal relation network and the self-adaptive graph network.
The cross-modal pedestrian recognition device disclosed by the invention is shown in fig. 6, and comprises a processor 601 and a storage medium 602, wherein the storage medium 602 is a computer readable storage medium, and computer instructions are stored on the storage medium, and the computer instructions execute the steps of the cross-modal pedestrian re-recognition method disclosed by the invention when running; the processor 601 loads and executes instructions and data in the storage medium 602 for implementing the cross-modality pedestrian re-identification method described above.

Claims (10)

1. The RGB-D-based cross-modal pedestrian re-recognition method comprises a training stage and a recognition stage, and is characterized in that the training stage comprises the following steps:
s1, constructing a cross-mode pedestrian re-identification network, wherein the pedestrian re-identification network comprises a global sub-network (100) and a local sub-network (200), and the global sub-network (100) comprises a depth image global branch (110), an RGB image global branch (120) and an identification branch (130); the local subnetwork (200) comprises a depth image local branch (210), an RGB image local branch (220) and a local Triplet loss function calculation module;
the depth image global branch (110) is used for extracting deep features G of the depth image D The method comprises the steps of carrying out a first treatment on the surface of the Comprises a first shallow feature extraction module which is connected in sequence and is used for extracting shallow features FS of an input depth image D The method comprises the steps of carrying out a first treatment on the surface of the A first deep feature extraction module for extracting the first deep feature from FS D Deep feature G of extracted depth image D The method comprises the steps of carrying out a first treatment on the surface of the A first average pooling module for using the average pooling method to pool the deep features G D Global feature FP of a depth image obtained from a camera D The method comprises the steps of carrying out a first treatment on the surface of the A first batch of normalization modules for global features FP D Batch standardization processing is carried out, and category characteristics FB of depth images are obtained through the full connection layer D The method comprises the steps of carrying out a first treatment on the surface of the A first softmax function calculation module for calculating category characteristics FB according to the real category labels of the training samples D Is a softmax loss function L of (2) gs1
The RGB image global branch (120) is used for extracting deep features G of the RGB image V The method comprises the steps of carrying out a first treatment on the surface of the Comprises a second shallow feature extraction module which is connected in sequence and is used for extracting shallow features FS of an input RGB image V The method comprises the steps of carrying out a first treatment on the surface of the A second deep feature extraction module for extracting the feature from FS V Deep feature G of RGB image V The method comprises the steps of carrying out a first treatment on the surface of the A second average pooling module for using the average pooling method to pool the deep features G V Global feature FP of an RGB image V The method comprises the steps of carrying out a first treatment on the surface of the A second batch of normalization modules for global features FP V Batch standardization processing is carried out, and category characteristics FB of RGB images are obtained through the full connection layer V The method comprises the steps of carrying out a first treatment on the surface of the A second softmax function calculation module for calculating the category feature FB according to the real category label of the training sample V Is a softmax loss function L of (2) gs2
The identification branch (130) is used for identifying deep features G according to depth images D And deep features G of RGB images V Obtaining the similarity C of the depth image and the RGB image d,v According to the similarity C d,v Obtaining a recognition result according to the global characteristic FP of the depth image D And global feature FP of RGB images V Computing global Triplet loss function L gt
The depth image local branch (210) is used for extracting deep features G of the depth image D Contextual local feature FG of (C) D The method comprises the steps of carrying out a first treatment on the surface of the Comprises a first local feature extraction module which is connected in sequence and is used for extracting deep features G of a depth image D Local feature FL of (2) D The method comprises the steps of carrying out a first treatment on the surface of the A first adaptive pooling module for obtaining local feature FL D Attention feature FA of (a) D The method comprises the steps of carrying out a first treatment on the surface of the A first local loss function calculation module for calculating a local softmax loss function L of the depth image ls1
The RGB image local branch (220) is used for extracting deep features G of the RGB image V Context local feature FB of (1) V The method comprises the steps of carrying out a first treatment on the surface of the Comprises a second local feature extraction module which is connected in sequence and is used for extracting deep features G of RGB images V Local feature FL of (2) V The method comprises the steps of carrying out a first treatment on the surface of the A second adaptive pooling module for obtaining the local feature FL V Attention feature FA of (a) V The method comprises the steps of carrying out a first treatment on the surface of the A second local loss function calculation module for calculatingLocal softmax loss function L for RGB image ls2
The local Triplet loss function calculation module is used for calculating local attention characteristic FA according to the depth image D And a local attention feature FA of an RGB image V Calculating a local Triplet loss function L lt
S2, training the network constructed in the step S2 by adopting a training set, wherein samples in the training set are depth images marked with categories and corresponding RGB images; the trained loss function is:
wherein the superscript τ represents the current number of epochs trained;
the identification phase comprises:
s3, respectively inputting the depth image and the RGB image in the cross-mode image pair to be identified into a depth image global branch (110) and an RGB image global branch (120), and identifying the branch (130) to obtain the similarity C of the depth image and the RGB image deep features d,v And according to C d,v And obtaining a recognition result.
2. The RGB-D-based cross-modality pedestrian re-recognition method of claim 1, wherein the similarity C in the recognition branch (130) d,v Deep features G for depth images D Deep features G with RGB images V The Euclidean distance between them, or the deep features G of the depth image D Deep features G with RGB images V Cosine similarity between them.
3. The RGB-D-based cross-modal pedestrian re-recognition method of claim 2, wherein the recognition branch (130) in the global sub-network further comprises a cross-modal relationship network, a full connection layer, a similarity correction module, and a contrast loss function calculation module;
the cross-modal relation network is used for deep layer characteristic G according to the depth image D And RGB mapDeep features G of images V Obtaining cross-modal relation characteristic G of depth image D,V And cross-modal relationship feature G of RGB images V,D
The cross-modal relation network comprises six convolution layers Conv1 to Conv6 and deep features G D Features after passing through the first convolution layer Conv1 and deep features G V Performing dot product operation on the features after passing through the second convolution layer Conv2 to obtain a cross-modal attention weight omega;
deep features G V After passing through the fourth convolution layer Conv4, multiplying the result by the cross-modal attention weight omega, and after passing through the sixth convolution layer Conv6, obtaining the result which is multiplied by the deep characteristic G D Adding to obtain a cross-modal relation characteristic G of the depth image D,V
Deep features G D After passing through the third convolution layer Conv3, multiplying the result by the cross-modal attention weight omega, and after passing through the fifth convolution layer Conv5, obtaining the result which is multiplied by the deep characteristic G V Adding to obtain a cross-modal relation characteristic G of the RGB image V,D
The full connection layer is used for obtaining the cross-modal relation characteristic G D,V And G V,D Obtaining the relation value S of the input depth image and RGB image d,v
The similarity correction module is used for correcting the similarity according to the relation value S d,v For similarity C d,v Correcting to obtain corrected similarity C d,v And according to C d,v Obtaining an identification result;
the contrast loss function calculation module is used for calculating a contrast loss function according to the relation value S d,v Calculate contrast loss function L C
The trained loss function is:
wherein E () represents the Triplet loss function L for 1 to τ -1 epochs gt Softmax loss function L gs1 And L gs2 An average value is calculated.
4. The RGB-D-based cross-modality pedestrian re-recognition method of claim 1, wherein the depth image local branch (210) further comprises a first adaptive graph network for attention feature FA from depth images D Extracting local features of the fused context;
the first adaptive graph network has L 1 Layers, each layer comprising N 1 A plurality of nodes; n (N) 1 The number of the local attention feature blocks of the depth image acquired by the first self-adaptive pooling module; ith node P in the first layer i l And the jth nodeThe edges between are adjacent matrix-> Where i, j=1, 2, …, N 1 ;|| || 2 To calculate the 2 norms; l=1, 2, …, L 1
Ith node P in the first layer i l The calculation formula of (2) is as follows:
wherein P is i 0 I-th local attention feature block FA for depth image D,i Alpha is a balance parameter, F l The first layer parameters are first self-adaptive graph network;
l th 1 Layer nodeNamely, the local feature FG of the ith fusion context of the depth image D,i The method comprises the steps of carrying out a first treatment on the surface of the FG is to be FG D,i Local feature FG of fusion context that is connected to get depth image D
5. The RGB-D-based cross-modality pedestrian re-recognition method of claim 1, wherein the RGB image local branch (220) further comprises a second adaptive graph network for determining the attention characteristic FA of the RGB image V Extracting local features of the fused context;
the second adaptive graph network has L 2 Layers, each layer comprising N 2 A plurality of nodes; n (N) 2 The number of the RGB image local attention feature blocks acquired by the second self-adaptive pooling module; ith node in the first layerAnd j-th node->The edges between are adjacent matrix-> Where i, j=1, 2, …, N 2 ;|| || 2 To calculate the 2 norms; l=1, 2, …, L 2
Ith node in the first layerThe calculation formula of (2) is as follows: />
Wherein the method comprises the steps ofI-th local attention feature block FA for RGB image V,i Alpha' is a balance parameter, H l Layer 1 parameters of a second self-adaptive graph network;
l th 2 Layer nodeNamely, the local feature FG of the ith fusion context of the RGB image V,i The method comprises the steps of carrying out a first treatment on the surface of the FG is to be FG V,i Local feature FG of fusion context that is connected to get RGB image V
6. The RGB-D cross-modality pedestrian re-recognition method of claim 1, wherein the first shallow feature extraction module and the second shallow feature extraction module are both composed of an initial layer of Resnet-50 and block 1.
7. The RGB-D cross-modal based pedestrian re-recognition method of claim 1 wherein the first deep feature extraction module and the second deep feature extraction module are weight sharing, the structure of which is comprised of blocks 2-4 of a Resnet-50 network.
8. The RGB-D cross-modality pedestrian re-recognition method of claim 1, wherein the first and second adaptive pooling modules employ downsampling of deep local feature blocks of input to obtain local attention feature blocks.
9. A computer readable storage medium having stored thereon computer instructions which, when run, perform the cross-modal pedestrian re-identification method of any one of claims 1 to 8.
10. A cross-modality pedestrian recognition device comprising a processor and a storage medium, the storage medium being the computer-readable storage medium of claim 9; the processor loads and executes instructions and data in the storage medium for implementing the cross-modality pedestrian re-identification method of any one of claims 1 to 8.
CN202111148969.XA 2021-09-29 2021-09-29 RGB-D-based cross-mode pedestrian re-identification method, storage medium and device Active CN113887382B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111148969.XA CN113887382B (en) 2021-09-29 2021-09-29 RGB-D-based cross-mode pedestrian re-identification method, storage medium and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111148969.XA CN113887382B (en) 2021-09-29 2021-09-29 RGB-D-based cross-mode pedestrian re-identification method, storage medium and device

Publications (2)

Publication Number Publication Date
CN113887382A CN113887382A (en) 2022-01-04
CN113887382B true CN113887382B (en) 2024-02-23

Family

ID=79007921

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111148969.XA Active CN113887382B (en) 2021-09-29 2021-09-29 RGB-D-based cross-mode pedestrian re-identification method, storage medium and device

Country Status (1)

Country Link
CN (1) CN113887382B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111931637A (en) * 2020-08-07 2020-11-13 华南理工大学 Cross-modal pedestrian re-identification method and system based on double-current convolutional neural network
CN112434796A (en) * 2020-12-09 2021-03-02 同济大学 Cross-modal pedestrian re-identification method based on local information learning
CN112906605A (en) * 2021-03-05 2021-06-04 南京航空航天大学 Cross-modal pedestrian re-identification method with high accuracy

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110175527B (en) * 2019-04-29 2022-03-25 北京百度网讯科技有限公司 Pedestrian re-identification method and device, computer equipment and readable medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111931637A (en) * 2020-08-07 2020-11-13 华南理工大学 Cross-modal pedestrian re-identification method and system based on double-current convolutional neural network
CN112434796A (en) * 2020-12-09 2021-03-02 同济大学 Cross-modal pedestrian re-identification method based on local information learning
CN112906605A (en) * 2021-03-05 2021-06-04 南京航空航天大学 Cross-modal pedestrian re-identification method with high accuracy

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
视觉显著性检测综述;温洪发;周晓飞;任小元;颜成钢;;杭州电子科技大学学报(自然科学版);20200315(第02期);全文 *

Also Published As

Publication number Publication date
CN113887382A (en) 2022-01-04

Similar Documents

Publication Publication Date Title
CN108960140B (en) Pedestrian re-identification method based on multi-region feature extraction and fusion
CN111259786B (en) Pedestrian re-identification method based on synchronous enhancement of appearance and motion information of video
CN111582409B (en) Training method of image tag classification network, image tag classification method and device
CN112070044B (en) Video object classification method and device
CN113313232B (en) Functional brain network classification method based on pre-training and graph neural network
CN113836992B (en) Label identification method, label identification model training method, device and equipment
CN112070010B (en) Pedestrian re-recognition method for enhancing local feature learning by combining multiple-loss dynamic training strategies
WO2021243947A1 (en) Object re-identification method and apparatus, and terminal and storage medium
CN113743544A (en) Cross-modal neural network construction method, pedestrian retrieval method and system
CN115497122A (en) Method, device and equipment for re-identifying blocked pedestrian and computer-storable medium
CN116740763A (en) Cross-mode pedestrian re-identification method based on dual-attention perception fusion network
CN116110118A (en) Pedestrian re-recognition and gait recognition method based on space-time feature complementary fusion
CN114764870A (en) Object positioning model processing method, object positioning device and computer equipment
CN112016592B (en) Domain adaptive semantic segmentation method and device based on cross domain category perception
CN112101154B (en) Video classification method, apparatus, computer device and storage medium
CN113947101A (en) Unsupervised pedestrian re-identification method and system based on softening similarity learning
CN114782977A (en) Method for guiding pedestrian re-identification based on topological information and affinity information
CN113887382B (en) RGB-D-based cross-mode pedestrian re-identification method, storage medium and device
CN112329716A (en) Pedestrian age group identification method based on gait characteristics
CN117173595A (en) Unmanned aerial vehicle aerial image target detection method based on improved YOLOv7
CN114386527A (en) Category regularization method and system for domain adaptive target detection
CN116978058A (en) Cross-modal constraint-based pedestrian re-identification method, system, medium and equipment
CN112784674B (en) Cross-domain identification method of key personnel search system based on class center self-adaption
CN116630635A (en) Target detection system and method for feature semantic guidance in data sparse scene
CN115731610A (en) Gait time sequence modeling method based on channel sliding

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant