CN114495281A

CN114495281A - Cross-modal pedestrian re-identification method based on integral and partial constraints

Info

Publication number: CN114495281A
Application number: CN202210124910.5A
Authority: CN
Inventors: 吕址函; 朱松豪; 梁志伟
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2022-02-10
Filing date: 2022-02-10
Publication date: 2022-05-13

Abstract

According to the overall and partial constraint-based cross-modal pedestrian re-identification method, local pedestrian features are extracted deeply from two different modalities by using a mixed crossed dual-path feature learning network, and then the extracted features are horizontally cut into p components and then are mapped to a public space for horizontally cutting the local features and the global features of an image, so that the pedestrian feature characterization capability is improved; finally, through the common cooperation of the modal specific identity loss, the cross entropy loss and the proposed loss function, the difference between the modalities is reduced, and the overall performance is improved; during training data, random horizontal flipping and random erasure of the enhancement data are used to expand the training data.

Description

Cross-modal pedestrian re-identification method based on integral and partial constraints

Technical Field

The invention belongs to the technical field of pedestrian re-identification, and particularly relates to a cross-modal pedestrian re-identification method based on integral and partial constraints.

Background

Pedestrian re-identification is a specific pedestrian retrieval task that uses computer vision techniques to determine whether a particular pedestrian is present in an image or video. In recent years, with the continuous development of society, people pay more and more attention to public safety problems, and the pedestrian re-identification arouses great research interest. At present, most of researches mainly process human images shot by a visible camera, however, the methods have many limitations. For example, many criminal events occur at night and conventional video cameras cannot capture clear images. Thus, these methods are not effective in the case of insufficient light.

Most of the existing research focuses on pedestrian re-identification in a visual modality, and compared with a visual image, an infrared image lacks rich color information, so that a common pedestrian re-identification method based on a visible light image is not feasible in the infrared image. It was found from a search of prior art documents that Wu et al presented a large scale cross modal pedestrian re-identification dataset named SYSY-MM01, while they evaluated three common neural network structures: single stream, dual stream and asymmetric fully connected layer, and deep zero padding is proposed for training single stream networks. Zheng et al introduced a joint learning framework that coupled end-to-end pedestrian re-identification learning and data generation to solve the infrared-visible light pedestrian re-identification problem. Mang et al propose a dynamic double attention clustering learning framework that avoids the learning of models from being easily disturbed by noise and becoming unstable. Li et al adds an X modality in the network to account for the modality difference. Many existing methods focus on reducing the difference between infrared and visual modalities, and recognition accuracy is less than ideal in the case of cable sharing. The method solves the problem of cross-modal pedestrian re-identification to a certain extent, but still has the defects.

Therefore, the cross-modal pedestrian re-identification has the following problems to be solved urgently, (1) the difference between the visible mode and the infrared mode is large, and the existing method has a certain effect but still has a large rising space; (2) and the cross-mode pedestrian re-identification data set is less, so that the training data is insufficient. The pedestrian identification method is not only a problem in cross-modal pedestrian identification, but also a common problem in pedestrian identification, the academic world does not have a data set with complex scene and large scale for research, and the industrial world has a large amount of data but cannot be sourced due to privacy problems.

Disclosure of Invention

In order to solve the problems, the invention provides a hybrid cross dual-path feature learning network (HCDFL), which deeply extracts local pedestrian features from two different modes; the method has the advantages that a novel overall constraint function and a partial triple-center loss function are utilized, inter-class and intra-class differences are improved from two aspects of different modes and the same mode respectively, local features of pedestrians are better represented, and overall recognition performance is improved; meanwhile, random horizontal flipping and random erasure of the enhancement data are used to expand the training data.

The invention relates to a cross-modal pedestrian re-identification method based on integral and partial constraints, which comprises the following steps:

s1, extracting pedestrian information characteristics under different modes from the RGB image and the infrared image under the same scene by using two independent and same branch networks respectively;

s2, uniformly dividing the extracted features into P horizontal components from top to bottom, projecting the P horizontal components to a public space, and outputting a joint representation of the modal specific features and the modal common features;

s3, constructing a multi-loss function, wherein the multi-loss function comprises mode specific identity loss, cross entropy loss, proposed overall constraint and partial triple-center loss, mixing and crossing the combined features by using the multi-loss function, and reducing image difference between the infrared mode and the RGB mode through mode distance constraint so as to obtain the optimal recognition performance.

Further, the multiple loss function is:

therein

And

respectively, representing the loss of identity, L, of the particular modality softmax of the RGB branch and the infrared branch_CERepresents the cross entropy loss, L_WCPTLRepresenting the overall constraint and the partial triplet-center loss function; λ represents the pre-training coefficients used to balance the overall loss function.

Further, the overall constraint process in the loss function includes two steps: firstly, the distance between different pedestrian samples in the same mode is enlarged, and the distance between the same pedestrian sample in the RGB mode and the same pedestrian sample in the infrared mode is reduced; then, the distance between the same pedestrian samples in the two modes is continuously reduced, the similarity of the identity of the pedestrian is improved, and the difference between different pedestrian samples in the modes is reduced; given the depth characteristics of pedestrians in different modes

Wherein i is more than or equal to 1 and less than or equal to N,

and

respectively representing the identity of the ith pedestrian in RGB and infrared modalities,

and

respectively representing the p-th identity and the q-th identity of the ith pedestrian under RGB and infrared modes, and the specific formula of the integral constraint LW is as follows:

the partial-center triplet loss formula is as follows:

where LP represents a partial triplet-center loss, x_iAnd z_iRespectively representing RGB image features and infrared image features, c_1yiAnd c_2yiRespectively representing the ith class center, y in RGB and infrared modalities_iAn identity tag representing the ith sample, a represents an offset, N represents a batch number,

represents the Euclidean distance, [ x ]]₊＝max(0,x)；

In summary, the overall constraint and partial triplet-center loss function can be expressed as:

L_WCPTL＝L_W+L_P。

further, modality specific identity loss: because the pedestrian characteristics in the RGB image and the infrared image are very different, different networks are used to obtain the characteristic representation in different modalities, and the Softmax loss is used to predict the pedestrian identity in each modality, and the formula can be expressed as follows:

in the formula

And

respectively represent belonging to

And

the ith RGB image feature and the infrared image feature of the class,

and

respectively represent the weight W in the last full connection layer^VAnd W^IJ (th) column of (b)^VAnd b^IRespectively representing RGB and infrared modal bias, M representing head of the line, N^VAnd N^IRespectively representing the number of RGB image and infrared image training samples in the same batch,

and

respectively representing the loss of identity functions of the RGB image and the infrared image.

Further, in order to make the feature characterization of the same pedestrian have similarity under different modes, a cross entropy loss function shown as follows is introduced:

wherein y_iThe true label representing the ith input image, i.e., p part features of each input image, share the label information of the image.

The invention has the beneficial effects that: the invention discloses a cross-modal pedestrian re-identification method based on integral and partial constraints, and provides a hybrid cross dual-path feature learning network, which is provided with a modal sharing parameter layer and a modal unique parameter layer for carrying out feature extraction on pedestrian pictures of different modalities; secondly, the network horizontally cuts the pedestrian features, so that local features and global features of the image can be better learned, and the pedestrian feature characterization capability is improved; meanwhile, aiming at the model embedding layer, the network forms a plurality of different batch combinations by cross-combining the features, which is beneficial to feature matching and modal distance constraint; when a loss function is designed, consistency constraint of feature distribution in different modal data classes and a correlation constraint criterion between the classes are fully considered, and a novel overall constraint and partial triple-center loss function are provided for improving modal differences and enabling samples of the same class to be closer to class centers and far away from other class centers; during training data, random horizontal flipping and random erasure of the enhancement data are used to expand the training data.

Drawings

FIG. 1 is a schematic flow diagram of the process of the present invention;

FIG. 2 is an end-to-end block diagram illustration of cross-modal pedestrian re-identification based on global constraints and partial triplet-center loss in accordance with the present invention;

FIG. 3 is a schematic diagram of the combination of triple loss, center loss, and softmax loss, respectively;

FIG. 4 is a schematic diagram of the overall constraint of the present invention;

FIG. 5 is a schematic of the partial-triplet center loss of the present invention;

FIG. 6 is a diagram illustrating the recognition effect of the present invention on the SYSU-MM01 and RegDB data sets.

Detailed Description

In order that the present invention may be more readily and clearly understood, a more particular description of the invention briefly described above will be rendered by reference to specific embodiments that are illustrated in the appended drawings.

With reference to fig. 1 and fig. 2, the method provided by the present invention first extracts pedestrian information in different modalities by using RGB branches and infrared branches of which the main networks are ResNet50, and uniformly divides the extracted features into p horizontal components from top to bottom by using an average pooling layer; then, projecting the horizontal cutting features to a common space, and outputting a joint representation of the modality specific features and the modality common features; and finally, mixing and crossing the joint features by using the modal specific identity loss, the cross entropy loss, the proposed overall constraint and the proposed partial triple-center loss, and obtaining the optimal recognition performance through modal distance constraint.

The overall constraint and partial triple-center loss proposed by the invention firstly integrally constrain the distances between different modes, thereby reducing the difference between RGB and infrared modes; secondly, the loss function learns the centers of the RGB mode and the infrared mode respectively by combining the triple loss and the center loss, so that the same class sample is closer to the class center and is far away from other class centers, and the intra-mode class difference is improved.

1. Hybrid cross dual path feature learning network

In visible single-mode-based pedestrian re-identification, a common method is to horizontally segment a pedestrian image, extract local features, and then perform feature matching on the pedestrian image, but in infrared-visible light pedestrian re-identification, an image shot by an infrared camera is greatly different from a visible image; the infrared image retains only inherent features such as the overall appearance of the pedestrian and the posture information of the pedestrian, but loses important information such as color and illumination. Therefore, the conventional method cannot be used to solve the infrared-visible light pedestrian re-recognition problem.

As shown in fig. 2, a dual-stream structure is used as a basic structure, mainly because the single-stream structure uses a common feature extraction network, and features of RGB and infrared images cannot be accurately extracted in the process; in addition, the single-flow structure shares global parameters, so that the local characteristics of pedestrians are seriously ignored. In the double-flow structure, the shallow network parameters are individually specific to each mode, and the deep network parameters are shared, so that the local characteristics and the global characteristics are considered, and the identification performance is improved. Therefore, the invention adopts the traditional two-way local feature network, and the network consists of a feature extractor and a feature embedding part.

The infrared-visible pedestrian re-identification dataset may be represented as D ═ { V, I }, where V represents an RGB image and I represents an infrared image.

In the feature extraction stage, after the features of the backbone network ResNet50 are extracted, the corresponding pedestrian features are respectively obtained, and the final average pooling layer and the network with the subsequent structure are removed, so that the purposes of expanding the area of the receiving domain and enriching the feature granularity are achieved. Particularly, the two branches use the same network structure, and the design can enable the high-level feature output to express high-level semantics better and enable the identity discrimination capability of the feature to be stronger.

In the characteristic embedding stage, firstly, horizontally dividing the pedestrian characteristics into p identical components for learning a low-dimensional embedding space between two heterogeneous modes; then, using the global pooling layer on each part, p 2048-dimensional features are obtained. In order to further reduce the feature dimension, dimension reduction operation is carried out on each 2048-dimensional component feature by adopting a 1 × 1 convolution layer, and finally 256-dimensional feature expression is obtained;

meanwhile, in order to avoid gradient extinction and calculate internal covariant offset, a batch standardization layer is added behind each full-connection layer; finally, the shared layer is used as a projection function to project the characteristics of two different modes to a common embedding space so as to close the difference between the two modes.

In the training stage, the network model is trained by combining the modal specific identity loss, the cross entropy loss, the proposed overall constraint and the proposed partial triplet-center loss so as to improve the accuracy of recognition. And dividing the joint representation characteristics of the RGB branches and the infrared branches into three groups by using mixed cross training, wherein the three groups are respectively part constraint, integral constraint and cross entropy loss, and the part constraint and the integral constraint form the proposed integral constraint and part triplet-center loss function. In the testing stage, the characteristics of the detection image and the gallery image are respectively extracted and then connected with the characteristics of the high-dimensional image to form a final characteristic descriptor.

After pedestrian features are respectively extracted through a backbone network by a traditional double-path feature learning network, the features are fused through a weight sharing module and are directly output, and cross-modal information is tried to be directly learned from two original modes. The results of the relevant experiments show that these methods are not sufficient to narrow the gap between the two modalities. The network provided by the invention combines pedestrian characteristics in a cross way to form a plurality of different batch combinations and combines multiple loss functions to cooperate together. The characteristic cross combination mode is beneficial to balancing the expression learning capacity of the model aiming at the characteristic characteristics and the shared characteristics of different modal data, and the matching capacity among the multimodal data is effectively improved.

The mode specific loss function directly utilizes mode information and reserves the most original pedestrian characteristics; the cross entropy loss is used for identifying the identity of the pedestrian, and RGB and infrared modal characteristics are extracted to form a batch; within the same batch, the characteristics of the RGB image and the infrared image have consistency, so that pairs of batchs are respectively constructed by using partial constraint and integral constraint.

Constructing a multi-loss function by joint cooperation as shown in formula (1), wherein the multi-loss function comprises modal specific identity loss, cross entropy loss, overall constraint loss and partial triple-center loss; the overall loss function of the proposed framework can be expressed as:

wherein the content of the first and second substances,

and

respectively, representing the loss of identity, L, of the particular modality softmax of the RGB branch and the infrared branch_ceRepresents the cross entropy loss L_WCPTRepresenting the overall constraint and the partial triplet-center loss function. λ represents the pre-training coefficients used to balance the overall loss function.

Modality specific identity loss: due to the fact that pedestrian characteristics in the RGB image and the infrared image are different greatly, different networks are used for obtaining characteristic representations in different modes. The Softmax loss is used to predict pedestrian identity in each modality, and the formula can be expressed as:

wherein the content of the first and second substances,

and

respectively represent belonging to

And

the ith RGB image feature and the infrared image feature of the class,

and

respectively represent the weight W in the last full connection layer^VAnd W^IJ (th) column of (b)^VAnd b^IRespectively representing RGB and infrared modal bias, M representing head of the line, N^VAnd N^IRespectively representing the number of RGB image and infrared image training samples in the same batch.

Cross entropy loss: in order to make the feature characterization of the same pedestrian have similarity under different modes, a cross entropy loss function shown as follows is introduced:

i.e. p part features of each input image share the label information of the image.

2. Global constraint and partial triplet-centric loss

The invention provides a novel overall constraint and partial triple-center loss, and the function improves the difference between classes and in classes from two aspects of different modes and the same mode respectively and improves the overall identification performance.

The triple loss function is often applied to the fields of face recognition, pedestrian re-recognition and the like, and not only has the characteristic of shortening the intra-class distance, but also has the characteristic of increasing the inter-class distance; for the infrared-visible light pedestrian re-identification task, the pedestrian images have the inter-class distance in the same mode and have the inter-class distances in different modes.

The triplet loss function is formulated as follows:

wherein the content of the first and second substances,

respectively representing the feature representations of the anchor point, the positive sample image and the negative sample image,

and

is the same as the identity information of (2), and

and

is different, alpha represents an offset, N represents a batch number,

represents the Euclidean distance, [ x ]]₊＝max(0,x)。

As can be seen from fig. 3(a), although the two loss functions are combined to achieve a good effect, the data distribution is not uniform, and the model performance is not stable. The center loss is firstly applied to the field of face recognition, is used for constraining the distance between a sample and the center of the class of the sample, and learns a center for each class. The central loss function is formulated as follows:

wherein x is_iFor the characteristic representation, y_iTo correspond to x_iClass (c) of_yiRepresents a category y_iM, represents the minimum batch size,

representing the euclidean distance.

As can be seen in connection with fig. 3(b), the key to the overall constraint loss learning of features between modalities is to narrow the cross-modality differences. Due to the drastic visual change, the cross-modal differences may be large, which will greatly reduce the pedestrian re-recognition performance, and thus the cross-modal differences need to be reduced as a whole.

With reference to fig. 4, the overall constraint process in the loss function proposed by the present invention includes two steps: firstly, expanding the distance between different pedestrian samples in the same mode, and simultaneously reducing the distance between the same pedestrian sample in the RGB mode and the infrared mode; then, the distance between the same pedestrian samples in the two modes is continuously reduced, the similarity of the identity of the pedestrian is improved, and the difference between different pedestrian samples in the modes is reduced. Given the depth characteristics of pedestrians in different modes

Wherein i is more than or equal to 1 and less than or equal to N,

and

and

respectively representing the p-th identity and the q-th identity of the ith pedestrian under RGB and infrared modes, and the specific formula is as follows:

with reference to fig. 5, by combining two loss functions, samples in two modalities can be considered at the same time, which is beneficial to reduce intra-class differences in the same modality, reduce differences between modalities, and improve recognition accuracy.

The partial-center triplet loss formula is as follows:

wherein x is_iAnd z_iRepresenting RGB image features and infrared image features, respectively, c_1yiAnd c_2yiRespectively representing the ith class center, y in RGB and infrared modalities_iAn identity tag representing the ith sample, a represents an offset, N represents a batch number,

represents the Euclidean distance, [ x ]]₊＝max(0,x)。

L_WCPTL＝L_W+L_P (8)

the result shown in fig. 6 is a schematic diagram of the recognition effect of the method of the present invention on the SYSU-MM01 and RegDB data sets, and it can be seen from the diagram that the recognition effect of the method is ideal and the recognition accuracy is high.

In summary, the present invention addresses the problems in cross-modal pedestrian re-identification, and on one hand, proposes a hybrid cross-over dual-path feature learning network (HCDFL) that deeply extracts local pedestrian features from two different modalities. The network model firstly extracts pedestrian features under different modes, then horizontally cuts the extracted features into p components and maps the components to a public space, and the components are used for horizontally cutting local features and global features of an image, so that the pedestrian feature characterization capability is improved; finally, the overall performance is improved through the common cooperation of the modal specific identity loss, the cross entropy loss and the proposed loss function; on the other hand, the invention provides a novel overall constraint and partial triple-center loss, and the function improves the difference between classes and in classes from two aspects of different modes and the same mode respectively, so as to better represent the local characteristics of pedestrians and improve the overall identification performance. The loss function provided by the invention firstly utilizes integral constraint to reduce the difference of different modes; then, by fusing the triple loss and the center loss, the difference between different types in the same modality is expanded, so that the samples in the same type are closer to the center of the sample and are far away from the centers of other types. In addition, due to the limited amount of image data, random horizontal flipping and random erasure of enhancement data are used to augment the training data during training. The experimental results on the two common data sets SYSU-MM01 and RegDB show that the method proposed herein achieves excellent recognition performance.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and all equivalent variations made by using the contents of the present specification and the drawings are within the protection scope of the present invention.

Claims

1. A cross-modal pedestrian re-identification method based on integral and partial constraints is characterized by comprising the following steps:

s2, uniformly dividing the extracted features into P horizontal components from top to bottom, projecting the P horizontal components to a common space, and outputting a joint representation of the modal specific features and the modal common features;

s3, constructing a multi-loss function, mixing and crossing the combined features by using the multi-loss function, and reducing the image difference between the infrared mode and the RGB mode through mode distance constraint so as to obtain the optimal recognition performance.

2. The method of claim 1, wherein the pedestrian cross-modal re-identification method based on whole and partial constraints is characterized in that,

the multiple loss function is:

therein

And

3. The method of claim 2, wherein the pedestrian cross-modal re-identification method based on whole and partial constraints is characterized in that,

the overall constraint process comprises two steps: firstly, expanding the distance between different pedestrian samples in the same mode, and simultaneously reducing the distance between the same pedestrian sample in the RGB mode and the infrared mode; then, the distance between the same pedestrian samples in the two modes is continuously reduced, the similarity of the identity of the pedestrian is improved, and the difference between different pedestrian samples in the modes is reduced; given the depth characteristics of pedestrians in different modes

Wherein i is more than or equal to 1 and less than or equal to N,

and

and

respectively representing the p-th and q-th identities of the ith pedestrian under RGB and infrared modes, and integrally constraining L_WThe specific formula of (2) is as follows:

the partial-center triplet loss formula is as follows:

wherein L is_PRepresents the partial triplet-center loss, x_iAnd z_iRespectively representing RGB image features and infrared image features, c_1yiAnd c_2yiRespectively representing the ith class center, y in RGB and infrared modalities_iAn identity tag representing the ith sample, a represents an offset, N represents a batch number,

represents the Euclidean distance, [ x ]]₊＝max(0,x)；

L_WCPTL＝L_W+L_P。

4. the method according to claim 2, wherein the method comprises the following steps:

modality specific identity loss: because the pedestrian characteristics in the RGB image and the infrared image are very different, different networks are used to obtain the characteristic representation in different modalities, and the Softmax loss is used to predict the pedestrian identity in each modality, and the formula can be expressed as follows:

in the formula

And

respectively represent belonging to

And

the ith RGB image feature and the infrared image feature of the class,

and

respectively represent the weight W in the last full connection layer^VAnd W^IJ (th) column of (b)^VAnd b^IRespectively representing RGB and infrared modal bias, M representing head of the line, N^VAnd N^IRespectively representing RGB image and infrared image training samples in the same batchThe number of the (c) component(s),

and

5. The method of claim 2, wherein the pedestrian cross-modal re-identification method based on whole and partial constraints is characterized in that,

the cross entropy loss function is:

wherein y is_iThe true label representing the ith input image, i.e., p part features of each input image, share the label information of the image.