CN113887382B

CN113887382B - RGB-D-based cross-mode pedestrian re-identification method, storage medium and device

Info

Publication number: CN113887382B
Application number: CN202111148969.XA
Authority: CN
Inventors: 吴晶晶; 蒋建国; 齐美彬; 尤小泉; 庄硕; 李小红
Original assignee: Hefei University of Technology
Current assignee: Hefei University of Technology
Priority date: 2021-09-29
Filing date: 2021-09-29
Publication date: 2024-02-23
Anticipated expiration: 2041-09-29
Also published as: CN113887382A

Abstract

The invention discloses a RGB-D-based cross-mode pedestrian re-identification method, which is characterized in that global features of a depth image and an RGB image are extracted through global branches of the depth image and the RGB image respectively, and local features of the depth image and the RGB image are extracted through local branches of the depth image and the RGB image; the cross-modal pedestrian re-recognition network is trained by fusing the global loss function and the local loss function. And utilizing the trained cross-mode pedestrian re-recognition network to conduct pedestrian re-recognition. The method improves the recognition accuracy by fully utilizing the relation between the two modes of the depth image and the RGB image.

Description

RGB-D-based cross-mode pedestrian re-identification method, storage medium and device

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a cross-mode pedestrian re-identification method based on a depth image and an RGB image.

Background

Given a Depth image (D) of the pedestrian of interest, RGB-D cross-modality pedestrian re-identifies the target pedestrian that is intended to be retrieved from the RGB pedestrian candidate library. Each pixel value in the depth map describes a depth value (distance value) of the scene corresponding to the point. Images of this modality are not susceptible to light variations. When the light condition is poor and the RGB image can not be shot, the image can be obtained to replace the RGB image. Compared with the traditional pedestrian re-identification based on RGB images, the RGB-D cross-mode pedestrian re-identification can realize the identification of pedestrians under the condition of poor light conditions such as the night. Therefore, the task is more in line with the actual requirements and can be applied to a wider range of actual scenes.

RGB-D cross-modality pedestrian re-recognition requires matching of pedestrians between pedestrian images of different modalities. In addition to challenges in RGB pedestrian re-recognition, such as changes in pedestrian viewing angles, cluttered backgrounds, shielding of pedestrians, etc., the vast differences between different modalities make recognition across modalities more difficult. Document 1: peng Zhang, jinsong Xu, qiang Wu, yan Huang, and Jian Zhang.2019.Top-push constrained modality-adaptive dictionary learning for cross-modality person re-identification.IEEE Transactions on Circuits and Systems for Video Technology, 12 (2019), 4554-4566, and document 2: the Jiaxuan Zhuo, junyong Zhu, jian huang Lai, and Jiao hua Xie.2017.Person re-identification on heterogeneous camera network.In CCF Chinese Conference on Computer Vision. Springer,280-291 recognizes pedestrians by extracting manually designed features, and lacks abstract semantic expression. Document 3: the Frank M Hafner, amban Bhuiyan, julian FP Kooij, and Eric granger.2019.RGB-depth cross-model person re-identification.in 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). IEEE,1-8. The method comprises the steps of training a depth map single-mode pedestrian and then identifying a network. And then training an RGB image recognition network, fixing parameters of the depth map network at the moment, and restraining the features of the RGB image to be similar to the features of the depth map corresponding to the RGB image one by utilizing a distillation learning technology as far as possible, so that the difference between modes is reduced. Although the method restricts the features of the cross-modal image pairs corresponding to one as similar as possible through a two-stage learning process, so that the difference between two modalities is reduced, and the accuracy of cross-modal identification is improved. But it ignores the rich relationships contained in any other cross-modal image pair. And in this method, the training process of the whole network is more complicated by two times of training. Furthermore, it uses only a simple global feature extraction network to obtain the features of pedestrians, ignoring the utilization of spatial context information. The spatial relationships contain abundant appearance details and pedestrian body shape information, which are important for improving the representation capability of the image, especially for the depth image only providing single-channel information.

Disclosure of Invention

The invention aims to: the invention provides a cross-mode pedestrian re-identification method based on a depth image and an RGB image, aiming at acquiring the relation between the two modes of the depth image and the RGB image and improving the identification precision.

The technical scheme is as follows: the invention provides a RGB-D-based cross-mode pedestrian re-identification method, which comprises a training stage and an identification stage, wherein the training stage comprises the following steps:

s1, constructing a cross-mode pedestrian re-identification network, wherein the pedestrian re-identification network comprises a global sub-network 100 and a local sub-network 200, and the global sub-network 100 comprises a depth image global branch 110, an RGB image global branch 120 and an identification branch 130; the local subnetwork 200 comprises a depth image local branch 210, an RGB image local branch 220 and a local Triplet loss function calculation module;

the depth image global branch 110 is used for extracting deep features G of the depth image _D The method comprises the steps of carrying out a first treatment on the surface of the Comprises a first shallow feature extraction module which is connected in sequence and is used for extracting shallow features FS of an input depth image _D The method comprises the steps of carrying out a first treatment on the surface of the A first deep feature extraction module for extracting the first deep feature from FS _D Deep feature G of extracted depth image _D The method comprises the steps of carrying out a first treatment on the surface of the A first average pooling module for using the average pooling method to pool the deep features G _D Global feature FP of a depth image obtained from a camera _D The method comprises the steps of carrying out a first treatment on the surface of the A first batch of normalization modules for global features FP _D Batch standardization processing is carried out, and category characteristics FB of depth images are obtained through the full connection layer _D The method comprises the steps of carrying out a first treatment on the surface of the A first softmax function calculation module for calculating category characteristics FB according to the real category labels of the training samples _D Is a softmax loss function L of (2) _gs1 ；

The global RGB image branch 120 is used for extracting deep features G of RGB images _V The method comprises the steps of carrying out a first treatment on the surface of the Comprises a second shallow feature extraction module which is connected in sequence and is used for extracting shallow features FS of an input RGB image _V The method comprises the steps of carrying out a first treatment on the surface of the A second deep feature extraction module for extracting the feature from FS _V Deep feature G of RGB image _V The method comprises the steps of carrying out a first treatment on the surface of the A second average pooling module for using the average pooling method to pool the deep features G _V Global feature FP of an RGB image _V The method comprises the steps of carrying out a first treatment on the surface of the A second batch of normalization modules for global features FP _V Batch standardization processing is carried out, and category characteristics FB of RGB images are obtained through the full connection layer _V The method comprises the steps of carrying out a first treatment on the surface of the A second softmax function calculation module for calculating the category feature FB according to the real category label of the training sample _V Is a softmax loss function L of (2) _gs2 ；

The identification branch 130 is configured to identify deep features G according to the depth image _D And deep features G of RGB images _V Obtaining the similarity C of the depth image and the RGB image _d,v According to the similarity C _d,v Obtaining a recognition result according to the global characteristic FP of the depth image _D And global feature FP of RGB images _V Computing global Triplet loss function L _gt ；

The local depth image branch 210 is used for extracting deep features G of the depth image _D Contextual local feature FG of (C) _D The method comprises the steps of carrying out a first treatment on the surface of the Comprises a first local feature extraction module which is connected in sequence and is used for extracting deep features G of a depth image _D Local feature FL of (2) _D The method comprises the steps of carrying out a first treatment on the surface of the A first adaptive pooling module for obtaining local feature FL _D Attention feature FA of (a) _D The method comprises the steps of carrying out a first treatment on the surface of the A first local loss function calculation module for calculating a local softmax loss function L of the depth image _ls1 ；

The RGB image local branch 220 is used for extracting deep features G of the RGB image _V Context local feature FB of (1) _V The method comprises the steps of carrying out a first treatment on the surface of the Comprises a second local feature extraction module which is connected in sequence and is used for extracting deep features G of RGB images _V Local feature FL of (2) _V The method comprises the steps of carrying out a first treatment on the surface of the A second adaptive pooling module for obtaining the local feature FL _V Attention feature FA of (a) _V The method comprises the steps of carrying out a first treatment on the surface of the A second local loss function calculation module for calculating a local softmax loss function L of the RGB image _ls2 ；

The local Triplet loss function calculation module is used for calculating local attention characteristic FA according to the depth image _D And a local attention feature FA of an RGB image _V Calculating a local Triplet loss function L _lt ；

S2, training the network constructed in the step S2 by adopting a training set, wherein samples in the training set are depth images marked with categories and corresponding RGB images; the trained loss function is:

where the superscript τ denotes the number of epochs currently trained.

The identification phase comprises:

s3, respectively inputting the depth image and the RGB image in the cross-modal image pair to be identified into the depth image global branch 110 and the RGB image global branch 120, and identifying the branch 130 to obtain the similarity C of the depth features of the depth image and the RGB image _d,v And according to the similarity C _d,v And obtaining a recognition result.

In another aspect, the present invention discloses a computer readable storage medium having stored thereon computer instructions that, when executed, perform the above-described cross-modality pedestrian re-recognition method.

On the other hand, the invention discloses a cross-mode pedestrian recognition device, which comprises a processor and a storage medium, wherein the storage medium is the computer readable storage medium; the processor loads and executes the instructions and data in the storage medium for implementing the above-described cross-modality pedestrian re-identification method.

The beneficial effects are that: compared with the prior art, the RGB-D cross-modal pedestrian re-recognition method based on cross-modal constraint disclosed by the invention obtains the recognition result through the relation between the depth image and the deep features of the RGB image, and fuses global and local error functions under two modes to train the whole recognition network, thereby fully utilizing the relation among the cross-modes and improving the network recognition precision.

Drawings

FIG. 1 is a diagram of a cross-modality pedestrian re-identification network in embodiment 1;

fig. 2 is a structural diagram of a global subnetwork in embodiment 2;

FIG. 3 is a block diagram of a cross-modal relationship network in embodiment 2;

fig. 4 is a partial subnetwork structure diagram in embodiment 3;

FIG. 5 is a diagram of a cross-modality pedestrian re-identification network in embodiment 3;

fig. 6 is a schematic diagram of a cross-modality pedestrian recognition device of the present disclosure.

Detailed Description

The invention is further elucidated below in connection with the drawings and the detailed description.

Example 1:

the embodiment discloses a RGB-D cross-modal pedestrian re-identification method based on cross-modal constraint, which comprises a training phase and an identification phase, wherein the training phase comprises the following steps:

s1, constructing a cross-mode pedestrian re-recognition network, wherein the pedestrian re-recognition network comprises a global sub-network 100 and a local sub-network 200, and the global sub-network 100 comprises a depth image global branch 110, an RGB image global branch 120 and a recognition branch 130; the local subnetwork 200 includes a depth image local branch 210, an RGB image local branch 220, a local Triplet loss function calculation module.

The depth image global branch 110 is used for extracting deep features G of the depth image _D The method comprises the steps of carrying out a first treatment on the surface of the Comprises a first shallow feature extraction module which is connected in sequence and is used for extracting shallow features FS of an input depth image _D The method comprises the steps of carrying out a first treatment on the surface of the A first deep feature extraction module for extracting the first deep feature from FS _D Deep feature G of extracted depth image _D The method comprises the steps of carrying out a first treatment on the surface of the A first average pooling module for using the average pooling method to pool the deep features G _D Global feature FP of a depth image obtained from a camera _D The method comprises the steps of carrying out a first treatment on the surface of the A first batch of normalization modules for global features FP _D Batch normalization (Batch Normalization) is performed, and category feature FB of depth image is obtained through full connection layer _D The method comprises the steps of carrying out a first treatment on the surface of the A first softmax function calculation module for calculating category characteristics FB according to the real category labels of the training samples _D Is a softmax loss function L of (2) _gs1 ；

The global RGB image branch 120 is used for extracting deep features G of RGB images _V The method comprises the steps of carrying out a first treatment on the surface of the Comprises a second shallow layer feature extraction module which is connected in sequenceShallow feature FS for extracting input RGB image _V The method comprises the steps of carrying out a first treatment on the surface of the A second deep feature extraction module for extracting the feature from FS _V Deep feature G of RGB image _V The method comprises the steps of carrying out a first treatment on the surface of the A second average pooling module for using the average pooling method to pool the deep features G _V Global feature FP of an RGB image _V The method comprises the steps of carrying out a first treatment on the surface of the A second batch of normalization modules for global features FP _V Batch normalization (Batch Normalization) is performed and class feature FB of RGB image is obtained through full connection layer _V The method comprises the steps of carrying out a first treatment on the surface of the A second softmax function calculation module for calculating the category feature FB according to the real category label of the training sample _V Is a softmax loss function L of (2) _gs2 ；

In this embodiment, the first shallow feature extraction module and the second shallow feature extraction module are both composed of an initial layer of Resnet-50 and block 1. For details of Resnet-50, see: kaiming He, xiangyu Zhang, shaoqing Ren, and Jian sun 2016.Deep residual learning for image recognment. In Proceedings of the IEEE conference on computer vision and pattern recognment. 770-778. Shallow feature extraction is not shared between two modes of a depth image and an RGB image, and aims to extract features of specific modes from images of the two modes respectively.

The first deep feature extraction module and the second deep feature extraction module are weight sharing and aim to extract common depth features of two modes, so that differences between the two modes are reduced. In this embodiment, the structure is composed of blocks 2-4 of the Resnet-50 network.

Global features FP obtained by the first averaging pooling module _D After the first batch of standardized modules and the full connection layer processing, category characteristics FB are obtained _D ，FB _D ∈R ^u U is the number of pedestrian categories.

FB _D The softmax loss function L is calculated by inputting the calculated softmax loss function L into a first softmax loss function calculation module _gs1 The method comprises the following specific steps of:

the probability of belonging to each class k predicted for the training samples is calculated first, k e {1,2, …, u }:

wherein FB is _D,k Is FB _D Values on the kth channel; then the loss function L _gs1 The calculation is as follows:

assuming that g is the true class of training samples, q (k) =1 when k=g; otherwise q (k) =0. L (L) _gs1 The probability of the correct category can be maximized.

Likewise, the global features FP obtained by the second averaging pooling module _V After the second batch of standardized modules and the full connection layer processing, category characteristics FB are obtained _V ，FB _V ∈R ^u 。FB _V The input is input into a second softmax loss function calculation module to calculate a softmax loss function L _gs2 。

The identification branch 130 is used for identifying deep features G according to the depth image _D And deep features G of RGB images _V Obtaining the similarity C of the depth image and the RGB image _d,v And according to the similarity C _d,v Obtaining a recognition result according to the global characteristic FP of the depth image _D And global feature FP of RGB images _V Computing global Triplet loss function L _gt 。

In this embodiment, the similarity C is calculated by the similarity calculation module _d,v In particular, deep features G of depth images are employed _D And deep features G of RGB images _V Euclidean distance of (a), namely:wherein G is _D And G _V Is all N in dimension _D ，G _D,n 、G _V,n G respectively _D And G _V Is the nth element of (a).

The more similar the two vectors are, the smaller the euclidean distance of the two. Thereby, the similarity C _d,v The smaller the value, the depth image is inputThe greater the likelihood that the same pedestrian will be the RGB image. The recognition result can be expressed by the following two ways:

mode one:

if C _d,v <R _th The input depth image and the RGB image are of the same kind, namely the same pedestrian; the other cases consider the input depth image and RGB image as different classes. Wherein R is _th The first recognition judgment threshold value can be obtained through multiple experimental statistics.

Mode two:

the depth image of the pedestrian to be identified and the RGB image in the RGB pedestrian candidate library form a plurality of depth images and RGB image pairs, each image pair is input into a cross-mode pedestrian re-identification network respectively, and the similarity (depth characteristic G of the depth image in the embodiment) of each image pair is obtained _D And deep features G of RGB images _V Euclidean distance) of the image pair with the smallest similarity value, the input depth image and the RGB image are the most likely to be of the same kind. In this embodiment, the identification branch 130 uses a global Triplet loss function calculation module to calculate the global feature FP according to the depth image _D And global feature FP of RGB images _V Calculating a Triplet loss function L _gt 。

Global features FP obtained by the first averaging pooling module _D And global features FP obtained by a second averaging pooling module _V As input to the global Triplet loss function calculation module, according to literature: alexander Hermans, lucas Beyer, and Bastin Leibe.2017.In defense of the Triplet loss for person re-identification.arXiv preprint arXiv:1703.07737 (2017), calculated Triplet loss function L _gt The method comprises the following steps: l (L) _gt ＝max(d _p -d _n +β,0)。

Wherein d is _p And d _n The distance between the features of the positive and negative pairs, respectively, β is the boundary of the Triplet loss function, β=0.3 in this example.

The depth image local branch 210 is used for extracting deep features G of the depth image _D Contextual local feature FG of (C) _D The method comprises the steps of carrying out a first treatment on the surface of the Comprising first local features connected in sequenceAn extraction module for extracting deep features G of the depth image _D Local feature FL of (2) _D The method comprises the steps of carrying out a first treatment on the surface of the A first adaptive pooling module for obtaining local feature FL _D Attention feature FA of (a) _D The method comprises the steps of carrying out a first treatment on the surface of the A first local loss function calculation module for calculating a local softmax loss function L of the depth image _ls1 ；

The local Triplet loss function calculation module is used for calculating local attention characteristic FA according to the depth image _D And a local attention feature FA of an RGB image _V Calculating a local Triplet loss function L _lt 。

In this embodiment, the first local feature extraction module and the second local feature extraction module are each formed by sequentially connecting three convolution layers with a core size of 3 and a step size of 1, thereby respectively performing deep feature G _D 、G _V Further coding to obtain FL respectively _D And FL (field effect transistor) _V 。

First adaptive pooling module pair FL _D Specifically, the treatment of (1) comprises:

FL is first put into _D Evenly split into N ₁ Block to obtain N ₁ Local feature FL _D,i ，i＝1,2,…,N ₁ The method comprises the steps of carrying out a first treatment on the surface of the N in the present embodiment ₁ The value of (2) is 4. To obtain more discriminative local features, the present embodiment uses adaptive pooling (Attentive pooling) to downsample FL _D,i . Specifically, a spatial distribution FL is adaptively learned for each local feature _D,i After normalizing the distribution using the softmax function, the spatial distribution is used as a downsampled weight to the local feature FL _D,i Downsampling is performed.This allows an automatic selection of a discriminative spatial portion from the local features to characterize the local region, so that the share of the discriminative region in the local space can be reinforced. The pooling process is performed for each FL _D,i Outputting a corresponding local attention feature block FA _D,i Together obtain N ₁ A local attention feature block.

Second adaptive pooling module pair FL _V Performing similar treatment to obtain N ₂ A local attention feature block. In the present embodiment, N ₁ 、N ₂ The values of (2) are all 4.

N obtained by the first self-adaptive pooling module ₁ The local attention feature blocks are connected and input into a first local loss function calculation module to calculate local Triplet loss functions L of the depth image respectively _lt1 And a local softmax loss function L _ls1 . Likewise, N obtained by the second adaptive pooling module ₂ The local attention feature blocks are connected and input into a second local loss function calculation module to calculate the local Triplet loss function L of the RGB image _lt2 And a local softmax loss function L _ls2 。

where the superscript τ denotes the number of epochs currently trained. Loss of contrast L _C This operation allows the cross-modal relational modeling of global features to gradually increase their proportion in the overall training after they become more discriminative, inversely proportional to the global loss function value.

The trained cross-mode pedestrian re-recognition network is applied to recognize the cross-module image pair to be recognized:

respectively inputting a depth image and an RGB image in a cross-mode image pair to be identified into a depth image globalThe branch 110 and the global branch 120 of RGB image, and the recognition branch 130 obtains the similarity C of the depth image and the RGB image _d,v And according to C _d,v And obtaining a recognition result.

Example 2:

the difference between this embodiment and embodiment 1 is that the identification branch 130 in the global sub-network further includes a cross-modal relationship network, a full connection layer, a similarity correction module, and a contrast loss function calculation module; fig. 2 is a schematic diagram of the global subnetwork 100 according to the present embodiment. Cross-modal relational network for deep features G from depth images _D And deep features G of RGB images _V Obtaining cross-modal relation characteristic G of depth image _D,V And cross-modal relationship feature G of RGB images _V,D 。

In this embodiment, the similarity C _d,v Deep features G employing depth images _D And deep features G of RGB images _V Cosine similarity of (c), namely:

unlike in embodiment 1, the more similar the two vectors are, the greater the cosine similarity of the two. Thereby, the similarity C _d,v The larger the value, the greater the likelihood that the input depth image and RGB image are the same pedestrian.

As shown in fig. 3, the cross-modal relationship network includes six convolution layers Conv1 to Conv6, deep features G _D Features after passing through the first convolution layer Conv1 and deep features G _V Performing dot product operation on the features after passing through the second convolution layer Conv2 to obtain a cross-modal attention weight omega;

deep features G of RGB images _V After passing through the fourth convolution layer Conv4, multiplying the result by the cross-modal attention weight omega, and after passing through the sixth convolution layer Conv6, obtaining the result which is multiplied by the deep characteristic G _D Adding to obtain a cross-modal relation characteristic G of the depth image _D,V ；

Deep features G of depth image _D Multiplied by the cross-modal attention weight omega after passing through the third convolution layer Conv3The result is passed through a fifth convolution layer Conv5 and then is combined with deep features G _V Adding to obtain a cross-modal relation characteristic G of the RGB image _V,D ；

The full connection layer is used for generating the cross-modal relation characteristic G _D,V And G _V,D Obtaining the relation value S of the input depth image and RGB image _d,v . In order to make the finally obtained relation value S _d,v The value range is between 0 and 1, a sigmoid function is added after the input of the full connection layer, and the output of the function is the relation value S of the input depth image and the RGB image _d,v 。

The similarity correction module is used for correcting the relation value S _d,v For similarity C _d,v Correcting to obtain corrected similarity C' _d,v And according to C' _d,v And obtaining a recognition result.

In this embodiment, the correction of the similarity adopts a simple superposition method, that is: c'. _d,v ＝C _d,v +S _d,v Indicated by "# in fig. 2.

The contrast loss function calculation module is used for calculating a contrast loss function according to the relation value S _d,v Calculate contrast loss function L _C 。

Contrast loss function L _C During the training phase, training samples (d, v) consisting of depth image d and RGB image v are used, wherein whether d and v are of the same type is known, and the labels I are defined _d,v The method comprises the following steps:

gamma is the smoothing parameter of the label, which in this embodiment is set to 0.1. The contrast loss function is:

L _C ＝-I _d,v log(S _d,v )-(1-I _d,v )log(1-S _d,v )

the loss function can reduce the difference between the two modalities by maximizing the relational features of the cross-modal positive sample pairs. Meanwhile, the method can minimize the relation value of the cross-modal negative sample pair to improve the distinguishing property of the global features.

In this embodiment, the trained loss function is:

wherein E () represents the Triplet loss function L for 1 to τ -1 epochs _gt Softmax loss function L _gs1 And L _gs2 An average value is calculated.

The present embodiment utilizes a cross-modal relationship network to obtain the relationship between any cross-modal sample pair. In order to fully utilize the relationship between the cross-modes, the embodiment applies cross-mode matching constraint to the cross-mode matching relationship, so that the difference between two heterogeneous images is reduced in an end-to-end mode (one-time training process), the network identification accuracy is improved, and meanwhile, the training complexity is reduced.

Similar to embodiment 1, the recognition result is represented by the following two ways:

mode one:

if C' _d,v >R′ _th The input depth image and the RGB image are of the same kind, namely the same pedestrian; the other cases consider the input depth image and RGB image as different classes. Wherein R 'is' _th The second recognition judgment threshold value can be obtained through multiple experimental statistics.

Mode two:

the depth image of the pedestrian to be identified and RGB images in the RGB pedestrian candidate library form a plurality of depth images and RGB image pairs, each image pair is input into a cross-mode pedestrian re-identification network respectively, and the similarity of each image pair is obtained (the similarity in the embodiment is the deep feature G of the depth image) _D And deep features G of RGB images _V Cosine similarity of (C), and corrected similarity C' _d,v C 'then' _d,v The highest value image pair has the highest possibility that the input depth image and RGB image are of the same kind.

Example 3:

this embodiment is a further improvement over embodiment 2, and differs from embodiment 2 in that the depth image local arm 210 further comprises a first adaptive map network, a firstAn adaptive network for attention feature FA from depth images _D Extracting local features of the fused context; the RGB image partial branch 220 further comprises a second adaptive graph network for generating a second adaptive graph network based on the attention characteristic FA of the RGB image _V The local features of the fused context are extracted and the local subnetwork structure is as in figure 4. The cross-modality pedestrian re-recognition network in this embodiment is shown in fig. 5.

The first adaptive graph network has L ₁ Layers, each layer comprising N ₁ A plurality of nodes; n (N) ₁ The number of the local attention feature blocks of the depth image acquired by the first self-adaptive pooling module; ith node in the first layerAnd j-th node->The edges between are adjacent matrix->Where i, j=1, 2, …, N ₁ ；|| || ₂ To calculate the 2 norms; l=1, 2, …, L ₁ ；

Adjacent matrixThe relation between any two nodes is flexibly represented. In order to embed the information of all the graph nodes into the output nodes to update the local spatial characteristics, the output nodes can be richer and more discriminant, so that the accuracy of identification is improved, and the embodiment adopts the following dynamic embedding:

ith node in the first layerThe calculation formula of (2) is as follows: />

Wherein the method comprises the steps ofI-th local attention feature block FA for depth image _D,i Alpha is a balance parameter to determine the weight occupied by the characterization of the ith node of the previous layer and the influence of other nodes on the ith node of the current layer during fusion, and in the embodiment, alpha is set to be 0.1.F (F) ^l Is a first layer parameter of the first adaptive graph network. The adaptive dynamic embedding is performed to enable +.>Consider all nodes of the upper layer +.>Is modified by the influence of (a) on the current layer node after correction>Therefore, the node characteristics of the output layer include the characteristics of the local area and the relation with other areas, which is more abundant.

L th ₁ Layer nodeNamely, the local feature FG of the ith fusion context of the depth image _D,i The method comprises the steps of carrying out a first treatment on the surface of the FG is to be FG _D,i Local feature FG of fusion context that is connected to get depth image _D 。

The second adaptive graph network has L ₂ Layers, each layer comprising N ₂ A plurality of nodes; n (N) ₂ The number of the RGB image local attention feature blocks acquired by the second self-adaptive pooling module; ith node in the first layerAnd j-th node->The edges between are adjacent matrix->Where i, j=1, 2, …, N ₂ And i+.j; i ₂ To calculate the 2 norms; l=1, 2, …, L ₂ ；

Ith node in the first layerThe calculation formula of (2) is as follows: />

Wherein the method comprises the steps ofI-th local attention feature block FA for RGB image _V,i Alpha' is a balance parameter, H ^l Layer 1 parameters of a second self-adaptive graph network;

l th ₂ Layer nodeNamely, the local feature FG of the ith fusion context of the RGB image _V,i The method comprises the steps of carrying out a first treatment on the surface of the FG is to be FG _V,i Local feature FG of fusion context that is connected to get RGB image _V 。

In the present embodiment, L ₁ And L ₂ All are 2.

Since only a single-channel image is included in the depth image. In the prior art, the whole characteristics of pedestrians are acquired by using a simple global characteristic extraction network, and the utilization of space context information is ignored. The embodiment models the graph structure between different local spaces in each mode image by utilizing the self-adaptive graph rolling network, and improves the identification accuracy by improving the characteristic characterization capability by acquiring rich space context information.

The base is generally uniformly applied by natural e or 2 in the logarithmic calculation in the loss function of examples 1-3.

The recognition effect of the cross-modal pedestrian re-recognition network in the above embodiments 1-3 was tested using the public dataset BIWI and compared with the existing algorithm, and the test results are shown in table 1.

TABLE 1 Experimental results

Methods	Recognition accuracy (%)
		ICMDL	7.1
Corr.Dict.	11.3
		Cross-modal distillation network	29.2
Example 1	43.7
		Example 2	45.5
Example 3	47.1

In table 1, ICMDL is identified using the method in document 1; corr. Dict. Is identified using the method in literature 2; cross-modal distillation network is identified by the method described in document 3.

As can be seen from table 1, the recognition average accuracy of the present invention is superior to the current advanced method. Moreover, as can be seen from the experimental results of no cross-modal relation network and no self-adaptive graph network, the performance of the network can be effectively improved by adding the cross-modal relation network and the self-adaptive graph network.

The cross-modal pedestrian recognition device disclosed by the invention is shown in fig. 6, and comprises a processor 601 and a storage medium 602, wherein the storage medium 602 is a computer readable storage medium, and computer instructions are stored on the storage medium, and the computer instructions execute the steps of the cross-modal pedestrian re-recognition method disclosed by the invention when running; the processor 601 loads and executes instructions and data in the storage medium 602 for implementing the cross-modality pedestrian re-identification method described above.

Claims

1. The RGB-D-based cross-modal pedestrian re-recognition method comprises a training stage and a recognition stage, and is characterized in that the training stage comprises the following steps:

s1, constructing a cross-mode pedestrian re-identification network, wherein the pedestrian re-identification network comprises a global sub-network (100) and a local sub-network (200), and the global sub-network (100) comprises a depth image global branch (110), an RGB image global branch (120) and an identification branch (130); the local subnetwork (200) comprises a depth image local branch (210), an RGB image local branch (220) and a local Triplet loss function calculation module;

the depth image global branch (110) is used for extracting deep features G of the depth image _D The method comprises the steps of carrying out a first treatment on the surface of the Comprises a first shallow feature extraction module which is connected in sequence and is used for extracting shallow features FS of an input depth image _D The method comprises the steps of carrying out a first treatment on the surface of the A first deep feature extraction module for extracting the first deep feature from FS _D Deep feature G of extracted depth image _D The method comprises the steps of carrying out a first treatment on the surface of the A first average pooling module for using the average pooling method to pool the deep features G _D Global feature FP of a depth image obtained from a camera _D The method comprises the steps of carrying out a first treatment on the surface of the A first batch of normalization modules for global features FP _D Batch standardization processing is carried out, and category characteristics FB of depth images are obtained through the full connection layer _D The method comprises the steps of carrying out a first treatment on the surface of the A first softmax function calculation module for calculating category characteristics FB according to the real category labels of the training samples _D Is a softmax loss function L of (2) _gs1 ；

The RGB image global branch (120) is used for extracting deep features G of the RGB image _V The method comprises the steps of carrying out a first treatment on the surface of the Comprises a second shallow feature extraction module which is connected in sequence and is used for extracting shallow features FS of an input RGB image _V The method comprises the steps of carrying out a first treatment on the surface of the A second deep feature extraction module for extracting the feature from FS _V Deep feature G of RGB image _V The method comprises the steps of carrying out a first treatment on the surface of the A second average pooling module for using the average pooling method to pool the deep features G _V Global feature FP of an RGB image _V The method comprises the steps of carrying out a first treatment on the surface of the A second batch of normalization modules for global features FP _V Batch standardization processing is carried out, and category characteristics FB of RGB images are obtained through the full connection layer _V The method comprises the steps of carrying out a first treatment on the surface of the A second softmax function calculation module for calculating the category feature FB according to the real category label of the training sample _V Is a softmax loss function L of (2) _gs2 ；

The identification branch (130) is used for identifying deep features G according to depth images _D And deep features G of RGB images _V Obtaining the similarity C of the depth image and the RGB image _d,v According to the similarity C _d,v Obtaining a recognition result according to the global characteristic FP of the depth image _D And global feature FP of RGB images _V Computing global Triplet loss function L _gt ；

The depth image local branch (210) is used for extracting deep features G of the depth image _D Contextual local feature FG of (C) _D The method comprises the steps of carrying out a first treatment on the surface of the Comprises a first local feature extraction module which is connected in sequence and is used for extracting deep features G of a depth image _D Local feature FL of (2) _D The method comprises the steps of carrying out a first treatment on the surface of the A first adaptive pooling module for obtaining local feature FL _D Attention feature FA of (a) _D The method comprises the steps of carrying out a first treatment on the surface of the A first local loss function calculation module for calculating a local softmax loss function L of the depth image _ls1 ；

The RGB image local branch (220) is used for extracting deep features G of the RGB image _V Context local feature FB of (1) _V The method comprises the steps of carrying out a first treatment on the surface of the Comprises a second local feature extraction module which is connected in sequence and is used for extracting deep features G of RGB images _V Local feature FL of (2) _V The method comprises the steps of carrying out a first treatment on the surface of the A second adaptive pooling module for obtaining the local feature FL _V Attention feature FA of (a) _V The method comprises the steps of carrying out a first treatment on the surface of the A second local loss function calculation module for calculatingLocal softmax loss function L for RGB image _ls2 ；

wherein the superscript τ represents the current number of epochs trained;

the identification phase comprises:

s3, respectively inputting the depth image and the RGB image in the cross-mode image pair to be identified into a depth image global branch (110) and an RGB image global branch (120), and identifying the branch (130) to obtain the similarity C of the depth image and the RGB image deep features _d,v And according to C _d,v And obtaining a recognition result.

2. The RGB-D-based cross-modality pedestrian re-recognition method of claim 1, wherein the similarity C in the recognition branch (130) _d,v Deep features G for depth images _D Deep features G with RGB images _V The Euclidean distance between them, or the deep features G of the depth image _D Deep features G with RGB images _V Cosine similarity between them.

3. The RGB-D-based cross-modal pedestrian re-recognition method of claim 2, wherein the recognition branch (130) in the global sub-network further comprises a cross-modal relationship network, a full connection layer, a similarity correction module, and a contrast loss function calculation module;

the cross-modal relation network is used for deep layer characteristic G according to the depth image _D And RGB mapDeep features G of images _V Obtaining cross-modal relation characteristic G of depth image _D,V And cross-modal relationship feature G of RGB images _V,D ；

The cross-modal relation network comprises six convolution layers Conv1 to Conv6 and deep features G _D Features after passing through the first convolution layer Conv1 and deep features G _V Performing dot product operation on the features after passing through the second convolution layer Conv2 to obtain a cross-modal attention weight omega;

deep features G _V After passing through the fourth convolution layer Conv4, multiplying the result by the cross-modal attention weight omega, and after passing through the sixth convolution layer Conv6, obtaining the result which is multiplied by the deep characteristic G _D Adding to obtain a cross-modal relation characteristic G of the depth image _D,V ；

Deep features G _D After passing through the third convolution layer Conv3, multiplying the result by the cross-modal attention weight omega, and after passing through the fifth convolution layer Conv5, obtaining the result which is multiplied by the deep characteristic G _V Adding to obtain a cross-modal relation characteristic G of the RGB image _V,D ；

The full connection layer is used for obtaining the cross-modal relation characteristic G _D,V And G _V,D Obtaining the relation value S of the input depth image and RGB image _d,v ；

The similarity correction module is used for correcting the similarity according to the relation value S _d,v For similarity C _d,v Correcting to obtain corrected similarity C _d ′ _,v And according to C _d ′ _,v Obtaining an identification result;

the contrast loss function calculation module is used for calculating a contrast loss function according to the relation value S _d,v Calculate contrast loss function L _C ；

The trained loss function is:

4. The RGB-D-based cross-modality pedestrian re-recognition method of claim 1, wherein the depth image local branch (210) further comprises a first adaptive graph network for attention feature FA from depth images _D Extracting local features of the fused context;

the first adaptive graph network has L ₁ Layers, each layer comprising N ₁ A plurality of nodes; n (N) ₁ The number of the local attention feature blocks of the depth image acquired by the first self-adaptive pooling module; ith node P in the first layer _i ^l And the jth nodeThe edges between are adjacent matrix-> Where i, j=1, 2, …, N ₁ ；|| || ₂ To calculate the 2 norms; l=1, 2, …, L ₁ ；

Ith node P in the first layer _i ^l The calculation formula of (2) is as follows:

wherein P is _i ⁰ I-th local attention feature block FA for depth image _D,i Alpha is a balance parameter, F ^l The first layer parameters are first self-adaptive graph network;

5. The RGB-D-based cross-modality pedestrian re-recognition method of claim 1, wherein the RGB image local branch (220) further comprises a second adaptive graph network for determining the attention characteristic FA of the RGB image _V Extracting local features of the fused context;

the second adaptive graph network has L ₂ Layers, each layer comprising N ₂ A plurality of nodes; n (N) ₂ The number of the RGB image local attention feature blocks acquired by the second self-adaptive pooling module; ith node in the first layerAnd j-th node->The edges between are adjacent matrix-> Where i, j=1, 2, …, N ₂ ；|| || ₂ To calculate the 2 norms; l=1, 2, …, L ₂ ；

Ith node in the first layerThe calculation formula of (2) is as follows: />

6. The RGB-D cross-modality pedestrian re-recognition method of claim 1, wherein the first shallow feature extraction module and the second shallow feature extraction module are both composed of an initial layer of Resnet-50 and block 1.

7. The RGB-D cross-modal based pedestrian re-recognition method of claim 1 wherein the first deep feature extraction module and the second deep feature extraction module are weight sharing, the structure of which is comprised of blocks 2-4 of a Resnet-50 network.

8. The RGB-D cross-modality pedestrian re-recognition method of claim 1, wherein the first and second adaptive pooling modules employ downsampling of deep local feature blocks of input to obtain local attention feature blocks.

9. A computer readable storage medium having stored thereon computer instructions which, when run, perform the cross-modal pedestrian re-identification method of any one of claims 1 to 8.

10. A cross-modality pedestrian recognition device comprising a processor and a storage medium, the storage medium being the computer-readable storage medium of claim 9; the processor loads and executes instructions and data in the storage medium for implementing the cross-modality pedestrian re-identification method of any one of claims 1 to 8.