CN114882521A

CN114882521A - Unsupervised pedestrian re-identification method and unsupervised pedestrian re-identification device based on multi-branch network

Info

Publication number: CN114882521A
Application number: CN202210333714.9A
Authority: CN
Inventors: 朱成博; 曲寒冰; 王鑫轩; 李国鑫; 阎刚
Original assignee: Beijing New Technology Application Research Institute Co ltd; Hebei University of Technology
Current assignee: Beijing New Technology Application Research Institute Co ltd; Hebei University of Technology
Priority date: 2022-03-30
Filing date: 2022-03-30
Publication date: 2022-08-09

Abstract

The present disclosure provides an unsupervised pedestrian re-identification training method based on a multi-branch network, which includes: in the source domain training stage, inputting images with labels into a pre-training model for training to obtain a trained pre-training model; and in the target domain training stage, inputting the image without the label into a target domain training model, and obtaining a pedestrian re-identification model after training. The disclosure also provides an unsupervised pedestrian re-recognition training device based on the multi-branch network, and an unsupervised pedestrian re-recognition method, an unsupervised pedestrian re-recognition training device, an electronic device and a readable storage medium based on the multi-branch network.

Description

Unsupervised pedestrian re-identification method and unsupervised pedestrian re-identification device based on multi-branch network

Technical Field

The present disclosure relates to the technical field of pedestrian re-identification, and in particular, to an unsupervised pedestrian re-identification training method and apparatus based on a multi-branch network, and further relates to an unsupervised pedestrian re-identification method and apparatus based on a multi-branch network, an electronic device, and a readable storage medium.

Background

Pedestrian Re-identification (Re-ID) is a fine-grained example identification problem aimed at finding a query person from a set of pedestrian images or videos captured in a distributed multiple non-overlapping camera system. Pedestrian re-identification has wide application in real life, such as criminal search, multi-camera tracking, missing person search, and the like. Re-identification relies heavily on the visually similar appearance of pedestrians from disjoint camera views, with the basic task of learning distinctive character features and associating queries with the best matching person in a gallery image or video. Illumination and geometric variations of different cameras are the biggest challenges. Most work has focused on supervised learning of small to large data sets in various environments. However, they require a large amount of paired cross-camera marking data, which limits the scalability to large-scale applications where only unmarked data is available, since paired marking data requires a significant expenditure of manpower and material resources. Due to the large diversity between different data sets, an Unsupervised Domain Adaptation (UDA) model is proposed that transfers a trained pre-trained model from an identity-tagged source domain to an unlabeled target image domain by training on the tagged source data domain. In real life, unmarked target domains can be easily recorded, and the model for detecting the re-identification of the pedestrians by using the images is more intuitive.

Unsupervised domain adaptive pedestrian re-identification is widely used because it can save manual labeling costs. The first category of methods learns domain invariant features by aligning target features, and some methods use semantic attributes to align feature distributions in the underlying space. However, these methods rely mostly on additional attribute annotations, requiring additional annotations manually. Another way is to convert the marked source domain image into the style of the target domain by style migration. A generative countermeasure network (GAN) is utilized to align feature distributions. SPGAN and PTGAN transform the source domain image to match the image style of the target domain while preserving person identity. The style-converted image and identity label are used to fine tune the model. Zhong et al achieve cross-camera training by learning the invariant properties of camera transfer. This approach relies heavily on the quality of the image generation and does not explore the complex relationships between different samples in the target domain. The second is pseudo-label based adaptation, a more direct unsupervised cross-domain re-recognition method, which assigns pseudo-labels directly to unlabeled target images and allows fine-tuning of the pre-training model in a supervised manner.

Unsupervised pedestrian re-recognition typically has lower performance than supervised pedestrian re-recognition due to the lack of paired signature data to learn camera invariant feature representations. Unsupervised pedestrian re-identification has significantly narrowed the performance gap with surveillance in recent years. Therefore, due to its scalable application, unsupervised attention has been increasingly focused in recent years. Most existing unsupervised methods group unlabeled images using a clustering algorithm and train the network with pseudo-labels generated by the clustering. But training of the neural network is hindered by the label noise. The noise mainly comes from the difference between a source domain and a target domain, illumination influence, defects of a clustering algorithm and the like. The effect of the pseudo-tag noise has a crucial impact on the final performance. To deal with noisy labels, one of the most popular approaches is to train paired networks so that they can correct each other, where the Co-tracher uses two student networks, two student and two teacher networks in MMT, but each network has the same structure. Easily converge towards each other and fall into local minima. To alleviate these problems, some approaches have chosen different training samples or different initialization parameters and data extensions. The clustering algorithm generates hard pseudo labels with 100% confidence. Since pedestrian re-identification is a fine-grained identification problem, it is not uncommon for people in the data set to have similar clothing, and the hard tags of these similar samples can be very noisy. In this case, soft pseudo labels (confidence < 100%) are more reliable. The hard pseudo tag and the soft pseudo tag are used for learning, so that tag noise can be effectively relieved. Ge et al propose add Mean Teacher in the model as online soft pseudo label generator, effectively reduced the error amplification condition because noise label produces during the training.

However, the existing methods mostly ignore the noise of hard pseudo labels generated by the clustering algorithm, which seriously hinders the training of the neural network. To mitigate cluster-based label noise, researchers have borrowed the idea of how to use unlabeled data in semi-supervised learning and noise label learning. The MMT employs two student networks and two teacher networks, the two student networks being initialized differently to enhance the diversity of the teacher-student networks. Each teacher network provides a pseudo tag to supervise the student networks of the other networks. However, although different initializations and different data enhancement techniques are used in the pairwise network, how to process noise samples in training is a problem faced by the cluster-based unsupervised pedestrian re-identification method.

Disclosure of Invention

In order to solve at least one of the above technical problems, the present disclosure provides an unsupervised pedestrian re-identification training method and apparatus based on a multi-branch network, and also provides an unsupervised pedestrian re-identification method and apparatus based on a multi-branch network, an electronic device and a readable storage medium.

According to one aspect of the disclosure, a method for unsupervised pedestrian re-identification training based on a multi-branch network is provided, which includes:

source domain training, namely inputting images with labels into a pre-training model for training in the source domain training stage to obtain a trained pre-training model;

in the target domain training stage, inputting an image without a label into a target domain training model, and obtaining the target domain training model after training;

the labels are used for identifying the types of the images, the types of the images correspond to pedestrians contained in the images, and the target domain training model comprises a trained pre-training model obtained in the source domain training stage.

According to at least one embodiment of the present disclosure, the method for unsupervised pedestrian re-identification training based on the multi-branch network comprises the following steps:

inputting each image in a training set and a label corresponding to the image into a pre-training model, and outputting a first global feature and a first local feature of each image after the images are processed by the training model, wherein the label is used for identifying the type of the image, and the type of the image corresponds to a pedestrian contained in the image;

inputting the first global feature and the first local feature into a first classifier, and respectively obtaining predicted values corresponding to the first global feature and the first local feature, wherein the predicted values are used for identifying the category of the corresponding image of the first global feature and the first local feature after the classification of the classifier;

constructing a pre-training loss function, calculating a training effect through the pre-training loss function, and repeatedly training the pre-training model until a calculated value of the pre-training loss function reaches an expected index;

wherein the pre-training loss function is obtained by weighted summation of a cross-entropy loss function and a triplet loss function.

According to at least one embodiment of the present disclosure, the unsupervised pedestrian re-identification training method based on the multi-branch network comprises the following steps:

the main network model is connected with each branch network model in series, receives the image input to the pre-training model, processes the image and inputs the obtained image characteristics to each branch network model;

the image feature clustering system comprises at least two branch network models, wherein each branch network model comprises a global branch network model and a local branch network model, the global branch network model obtains global features by performing global maximum clustering on the image features, and the local branch network models are used for performing average clustering on the image features of the images and then splicing the image features together to obtain local features;

the main network comprises a convolutional neural network and a convolutional module attention mechanism module, and the branch network comprises a convolutional neural network and a convolutional module attention mechanism module.

According to at least one embodiment of the present disclosure, the unsupervised pedestrian re-identification training method based on the multi-branch network comprises:

the channel attention module is used for receiving image characteristics, wherein the image characteristics are obtained after an image input to the backbone network model is processed by the backbone model, the characteristics of the image are subjected to global maximum pooling operation and average pooling operation respectively to obtain two first characteristic graphs, the two first characteristic graphs are subjected to splicing operation, and a space attention weight value is generated through convolutional layer and normalization operation;

the spatial attention module is used for receiving the feature maps output by the channel attention module, respectively carrying out global maximum pooling operation and global average pooling operation on the feature maps output by the channel attention module to obtain two second feature maps, carrying out splicing operation on the two second feature maps based on an image channel, and generating a spatial attention weight through convolutional layer and normalization operation;

wherein the channel attention module and the space attention module are connected in a serial manner.

According to the unsupervised pedestrian re-recognition training method based on the multi-branch network, the target domain training model comprises the following steps:

the teacher network receives each image of the target domain training set and outputs a second global feature and a second local feature corresponding to each image;

the student network receives each image of the target domain training set and outputs a third global feature and a third local feature corresponding to each image;

the second classifier is used for receiving a second global feature and a second local feature output by a teacher network and outputting soft labels of images corresponding to the second global feature and the second local feature respectively, wherein the soft labels are used for identifying the classes of the images corresponding to the target domain training set;

the clustering model is used for receiving the characteristic output by the teacher network and spliced by the second global characteristic and the second local characteristic, and generating a hard label through clustering, wherein the hard label is used for identifying the category of the corresponding image in the target domain training set;

the third classifier is used for receiving a third global feature and a third local feature output by the student network and outputting a predicted value of an image corresponding to the third global feature and the third local feature, wherein the predicted value is used for identifying the category of the corresponding image in the target domain training set;

the structures of the teacher network and the student network are consistent with the structure of the training model in the source domain training stage, and are the trained pre-training models obtained in the source domain training stage. According to at least one embodiment of the present disclosure, in a training phase of a target domain training model, the teacher network and the student networks are supervised by an asymmetric branch supervision mode, and each student network is supervised by a different teacher network branch, including:

a soft label generated by the global branch of the teacher network supervises the local branch of the student network;

a soft label generated by the local branch of the teacher network supervises the global branch of the student network;

the teacher network generates hard labels that supervise the global branch and the layout branch of the student network, respectively.

According to the unsupervised pedestrian re-identification training method based on the multi-branch network, in the target training model stage, a pre-training loss function is constructed, a training effect is calculated through a target training model loss function, the target training model is repeatedly trained until a calculated value of the target training loss function reaches an expected index, and the target training model loss function is obtained through a weighted summation mode of a cross entropy loss function of a hard label after clustering by a student network, a soft cross entropy loss function under the supervision of asymmetric branches and a soft triple loss function under the supervision of non-productive branches.

According to still another aspect of the present disclosure, there is provided a multi-branch network-based unsupervised pedestrian re-recognition training device, including:

a source domain training module comprising;

a trunk network model serially connected to each branch network model;

the local branch network model is used for splicing all the characteristics together after average pooling to obtain local characteristics;

the pre-training loss function calculation module is used for constructing a pre-training loss function and calculating a training effect through the pre-training loss function;

a target domain training module comprising:

the second classifier is used for receiving a second global feature and a second local feature output by a teacher network and outputting soft labels of images corresponding to the second global feature and the second local feature respectively, wherein the soft labels are used for identifying the category of the corresponding image in the target domain training set;

and the target domain training loss function calculation module is used for constructing a pre-training loss function and calculating a training effect through a target training model loss function.

According to still another aspect of the present disclosure, there is provided a multi-branch network-based unsupervised pedestrian re-recognition apparatus including:

the student network receives an image containing a pedestrian, and the third classifier outputs an image category corresponding to the identity of the pedestrian contained in the image.

According to still another aspect of the present disclosure, there is provided a method for unsupervised pedestrian re-identification based on a multi-branch network, including:

inputting the image or the image sequence containing the pedestrian into the unsupervised pedestrian re-identification device based on the multi-branch network, and obtaining the category of the pedestrian through the identification of the unsupervised pedestrian re-identification device based on the multi-branch network.

According to yet another aspect of the present disclosure, there is provided an electronic device including:

a memory storing execution instructions;

a processor executing execution instructions stored by the memory to cause the processor to perform any of the methods described above.

According to yet another aspect of the present disclosure, there is provided a readable storage medium having stored therein execution instructions for implementing any of the above methods when executed by a processor.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the disclosure and together with the description serve to explain the principles of the disclosure.

Fig. 1 is a general flow diagram of an unsupervised pedestrian re-identification method based on an attention mechanism according to one embodiment of the present disclosure.

FIG. 2 is a schematic diagram of a training module structure on a tagged source domain according to one embodiment of the present disclosure.

FIG. 3 is a schematic illustration of an attention mechanism module configuration according to one embodiment of the present disclosure.

FIG. 4 is a diagram of a training module in a target domain teacher-student network according to one embodiment of the present disclosure.

FIG. 5 is a diagram of a pseudo tag co-training module, according to one embodiment of the present disclosure.

Fig. 6 is a schematic diagram of EMA weight update according to an embodiment of the present disclosure.

Fig. 7 is a flow chart of an unsupervised pedestrian re-identification method based on a multi-branch network according to an embodiment of the disclosure.

Fig. 8 is a schematic structural diagram of an unsupervised pedestrian re-identification device based on a multi-branch network according to an embodiment of the disclosure.

Description of the reference numerals

1000 unsupervised pedestrian re-identification device based on multi-branch network

1002 characteristic diagram extraction module

1004 salient feature extraction module

1006 global feature extraction module

1008 local feature extraction module

1010 pedestrian identity judgment module

1100 bus

1200 processor

1300 memory

1400 and other circuits.

Detailed Description

The present disclosure will be described in further detail with reference to the drawings and embodiments. It is to be understood that the specific embodiments described herein are for purposes of illustration only and are not to be construed as limitations of the present disclosure. It should be further noted that, for the convenience of description, only the portions relevant to the present disclosure are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict. Technical solutions of the present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Unless otherwise indicated, the illustrated exemplary embodiments/examples are to be understood as providing exemplary features of various details of some ways in which the technical concepts of the present disclosure may be practiced. Accordingly, unless otherwise indicated, features of the various embodiments may be additionally combined, separated, interchanged, and/or rearranged without departing from the technical concept of the present disclosure.

The use of cross-hatching and/or shading in the drawings is generally used to clarify the boundaries between adjacent components. As such, unless otherwise noted, the presence or absence of cross-hatching or shading does not convey or indicate any preference or requirement for a particular material, material property, size, proportion, commonality between the illustrated components and/or any other characteristic, attribute, property, etc., of a component. Further, in the drawings, the size and relative sizes of components may be exaggerated for clarity and/or descriptive purposes. While example embodiments may be practiced differently, the specific process sequence may be performed in a different order than that described. For example, two processes described consecutively may be performed substantially simultaneously or in reverse order to that described. In addition, like reference numerals denote like parts.

When an element is referred to as being "on" or "on," "connected to" or "coupled to" another element, it can be directly on, connected or coupled to the other element or intervening elements may be present. However, when an element is referred to as being "directly on," "directly connected to" or "directly coupled to" another element, there are no intervening elements present. For purposes of this disclosure, the term "connected" may refer to physically, electrically, etc., and may or may not have intermediate components.

The terminology used herein is for the purpose of describing particular embodiments and is not intended to be limiting. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. Furthermore, when the terms "comprises" and/or "comprising" and variations thereof are used in this specification, the presence of stated features, integers, steps, operations, elements, components and/or groups thereof are stated but does not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components and/or groups thereof. It is also noted that, as used herein, the terms "substantially," "about," and other similar terms are used as approximate terms and not as degree terms, and as such, are used to interpret inherent deviations in measured values, calculated values, and/or provided values that would be recognized by one of ordinary skill in the art.

Fig. 1 is a flow chart of an unsupervised pedestrian re-identification training method based on a multi-branch network.

The present disclosure relates to embodiments, and relates to training each model, and requires that training sample data be prepared in advance, and detailed information of the training sample data is as follows. In this and other embodiments, the data sets used include Market-1501, DukeMTMC-reiD, and MSMT 17. The Market1501 data set contains 32668 images from 1501 pedestrians captured by 6 cameras. 12936 images of 751 pedestrians among them were selected for training, and the others were used for testing. In the test set, 3368 pictures were used as query images, and the remaining 19732 pictures were used as gallery images. The DukeMTMC-reID data set was captured by 8 cameras, of which 16522 images of 702 pedestrians were used for training and another 702 pedestrians were used for testing, the test set containing 2228 query images and 17661 gallery images. MSMT17 contains 126441 images from 4101 pedestrians captured by 15 cameras at the university of beijing. Of these, 32621 images of 1041 pedestrians were used for training, and others were used for testing. In the test set, 11659 images were used as query images, and 82161 remaining images were used as gallery images. Taking the example of the Market-1501 data set, the Market-1501 directory structure includes a bounding _ box _ test (test set), a bounding _ box _ train (test set), and a query (query set). The naming mode of the Market-1501 data set picture is as an example 0342_ c3s1_078792_03. jpg: 0342: and the human object labels in the picture are numbered. c 3: and the third camera (camera 3). s 1: the first video segment (sequence 1). 078792: 078792 th picture in the video clip. 03: the 3 rd detection box on this frame.

As shown in fig. 1, the unsupervised pedestrian re-identification training method based on the multi-branch network includes the following training process.

First, source domain training is performed. And in the source domain training stage, inputting the image with the label into a pre-training model for training to obtain a trained pre-training model. The tag is used to identify a category of the image, which corresponds to the identity of a pedestrian contained in the image. In the source domain training stage, firstly, a pre-training model is constructed, then, the features of the image input into the pre-training model are extracted through the pre-training model, and finally, in the repeated training process, the training effect is judged through the calculated value of the loss function until the preset training index is met. Wherein the loss function is constructed based on the cross-entropy losses and the triplet losses. FIG. 2 is one embodiment of a pre-trained model.

Second, target domain training is performed. And in the target domain training stage, inputting the image without the label into a target domain training model, and obtaining the target domain training model after training. The composition of the target domain training model comprises a trained pre-training model obtained in the source domain training phase. In the target domain training stage, an unlabelled image is input into a target domain training model consisting of a teacher network and a student network, and firstly, features are extracted through convolutional neural network layers of the teacher network and the student network respectively to obtain a feature map of the image; then, extracting the salient features of the images through attention mechanism modules of a teacher network and a student network respectively; then, inputting the image characteristics based on the characteristic graph and the salient characteristics into branch network parts of a teacher network and a student network respectively, and extracting global characteristics and local characteristics of the image through the branch networks respectively; then, a soft tag and a hard tag are generated through a classifier and a clustering model in a branch network part of the teacher network, and meanwhile, a predicted value is generated through a classifier and a predicted image category in a branch network part of the student network. In the training process, a loss function value is calculated through a loss function which is constructed in advance, meanwhile, network structure parameter models of the teacher network and the student network are continuously updated, namely the purposes of monitoring the branch networks of the teacher network and the student network of the student network are achieved, and whether a target domain training model reaches a training index is judged through the loss function value. FIG. 4 is a particular embodiment of a target domain training model.

Through the training of the two stages of the source domain training and the target domain training, the identification model for pedestrian identification, namely the unsupervised pedestrian re-identification device based on the multi-branch network, can be obtained. By outputting the image containing the pedestrian to the unsupervised pedestrian re-identification device based on the multi-branch network, the identity of the pedestrian in the image can be obtained through the identification of the device. In the target training stage, the teacher network and the student network are obtained by migrating the trained pre-training model obtained by the training in the source domain training stage.

As shown in fig. 2, the composition of the pre-trained model includes the following parts.

And the main network model is connected with each branch network model in series, receives the image input to the pre-training model, processes the image and inputs the obtained image characteristics to each branch network model. The main network comprises a convolution neural network and a convolution module attention mechanism module. In specific implementation, the backbone network model combines an OSNet network and a CBAM attention mechanism.

The local branch network model is used for splicing image features of the images together after the image features are averaged pooled to obtain local features. The branch network comprises a convolution neural network and a convolution module attention mechanism module. In a specific implementation, the branch network model combines an OSNet network and a CBAM attention mechanism. Attention mechanisms are added after the second layer (trunk branches) and before the fourth layer (global branches and local branches) of the OSNet, respectively.

In specific implementation, the first three layers of the OSNet network are used as a main network, the second two layers are used as a global branch network and a local branch network, wherein the global branch characteristics obtain final characteristics through global maximum pooling, and the local branch divides the characteristics into 6 parts which are respectively subjected to average pooling splicing together. CBAM attention mechanisms are designed in the backbone network, global branch and local branch. The CBAM attention mechanism is shown in fig. 3. The CBAM consists of two independent sub-modules, a channel attention module (top half of fig. 3) and a spatial attention module (bottom half of fig. 3) combined in a serial fashion.

As shown in fig. 3, attention is directed to a schematic of the structure of the force mechanism module. The composition of the attention module includes the following parts.

And the channel attention module is used for receiving image characteristics, wherein the image characteristics are obtained after the image input to the trunk network model is processed by the trunk model, the characteristics of the image are respectively subjected to global maximum pooling operation and average pooling operation to obtain two first characteristic graphs, the two first characteristic graphs are spliced, and then the space attention weight is generated through convolutional layer and normalization operation. In specific implementation, the input features x (H multiplied by W multiplied by C) are respectively subjected to global maximum pooling and global average pooling to obtain two 1 multiplied by C feature maps, the feature maps are sent to a shared connection layer, and the feature maps are added and sigmoid activated to generate a final channel attention weight.

And the space attention module is used for receiving the feature maps output by the channel attention module, respectively carrying out global maximum pooling operation and global average pooling operation on the feature maps output by the channel attention module to obtain two second feature maps, carrying out splicing operation on the two second feature maps based on the image channels, and generating a space attention weight through convolutional layer and normalization operation. In specific implementation, the feature maps output by the channel attention module are subjected to global maximum pooling and global average pooling to obtain two H multiplied by W multiplied by 1 feature maps, and then the two feature maps are subjected to concat splicing operation based on the channel. The spatial attention weight is generated by convolutional layer and sigmod operations. The channel attention module and the space attention module are connected in a serial mode.

The method of source domain training by the pre-trained model shown in fig. 2 includes the following steps.

Inputting each image in the training set and the label corresponding to the image into the pre-training model shown in fig. 2, and outputting the first global feature of each image after the pre-training model is processed

And a first partial feature

The tag is used to identify a category of the image, which corresponds to the identity of a pedestrian contained in the image. In a specific implementation, the feature dimension of each branch output of the pre-training model is 2048 dimensions. Wherein, the details of the image data in the training set are as follows: taking a Market-1501 data set as source domain data as an example, inputting a source domain data image

And a truth label y' i (for identifying the identity of the pedestrian in the image). In the training process, the setting conditions of the relevant parameters of the pre-training model are as follows: the maximum epoch is set to 80. For each epoch, the number of iterations Rpre is 200. At each iteration, 64 images of 16 identities were adjusted to 256 x 128 and inputA network. The initial learning rate is set to 0.00035 and at 40 th, 70 epoch is multiplied by 0.1.

The first global feature

And a first partial feature

Input to a first classifier to obtain first global features

And a first partial feature

Corresponding predicted value is obtained, namely the maximum characteristic prediction of the identity of the training set of the prediction source domain

And average feature prediction

The predicted value is used for identifying the category of the image corresponding to the first global feature and the first local feature after being classified by the classifier, and the category of the image corresponds to the identity of the pedestrian contained in the image. The number of classifications of the first classifier is determined by the number of identities of pedestrians in the data set, namely: there are as many different pedestrians as there are classification results.

And constructing a pre-training loss function, calculating a training effect through the pre-training loss function, and repeatedly training the pre-training model until the calculated value of the pre-training loss function reaches an expected index. The pre-training loss function is obtained by the weighted summation of the cross entropy loss function and the triplet loss function. In a specific implementation, the pre-training loss function is as follows:

wherein the content of the first and second substances,

respectively, representing the weight parameters of the source domain loss function.

As shown in fig. 4, the target domain training model includes the following components.

And the Teacher network (Teacher Backbone) receives each image of the target domain training set and outputs a second global feature and a second local feature corresponding to each image. In one embodiment, the image size may be 128 × 64, or 256 × 128.

And a Student network (Student Backbone) for receiving each image of the target domain training set and outputting a third global feature and a third local feature corresponding to each image.

And a second classifier (not shown in the figure) for receiving the second global feature and the second local feature output by the teacher network and outputting soft labels of the images corresponding to the second global feature and the second local feature respectively, wherein the soft labels are used for identifying the category of the corresponding image in the target domain training set.

And the clustering model is used for receiving the second global characteristic output by the teacher network and the characteristic spliced by the second local characteristic (Contact operation shown in the figure), and generating a hard label through clustering, wherein the hard label is used for identifying the category of the corresponding image in the target domain training set. In specific implementation, the clustering model can be realized by DBSCAN clustering. By setting the radius (r) and how many points (min _ samples) are included in the radius range as evaluation indexes, the density level classification is carried out on different points in the data set (the point corresponds to the feature after the second global feature and the second local feature are relied on, and each point corresponds to one image in the data set). If the number of points included in a certain point in an r range is more than min _ samples, the certain point is called a core point (core point); if a certain point is not a core point but is in an r field of the certain core point, the point is called a boundary point; and if a point is neither a core point nor a boundary point, the point is called a noise point. After the three types of division are completed for all the points in the data set, then a core point in the r range is divided into a cluster, the boundary points are divided into a cluster to which an adjacent core point belongs, the noise points are discarded, and finally the division of the whole data set is completed. In specific implementation, the setting of relevant parameters of the clustering method comprises the following steps: the density radius r is 0.002, and the minimum cluster sample min _ samples is 4. The epoch is set to 40. For each epoch, the number of iterations Rpre is 400. At each iteration, 64 images of 16 cluster-based pseudo-identities were adjusted to 256 x 128 and input into the network. The network fixed learning rate is 0.00035.

And a third classifier (not shown in the figure) for receiving the third global feature and the third local feature output by the student network and outputting a predicted value of the image corresponding to the third global feature and the third local feature, wherein the predicted value is used for identifying the category of the image corresponding to the target domain training set, and the category of the image corresponds to the identity of the pedestrian contained in the image.

And the target domain training loss function calculation module is used for constructing a target domain pre-training loss function and judging the training effect according to the calculated value of the target domain training loss function. The construction process of the target domain training loss function is as follows.

Cross entropy loss function of student network under clustering label:

asymmetric branch supervised soft cross entropy loss function:

asymmetric branch supervised soft triplet loss function:

wherein the content of the first and second substances,

representing a sample

The distance of the softmax triplet of (c),

and

positive and negative examples for each batch are shown separately. The determination of positive and negative samples utilizes pseudo-labels generated based on clustering.

The target domain training loss function is expressed as:

wherein the content of the first and second substances,

and

and respectively representing the weight coefficients of clustering loss, soft cross entropy loss and soft triplet loss.

In the above components, the structures of the teacher network and the student network are consistent with the structure of the pre-training model in the source domain training stage, the pre-training model is trained in the source domain training stage, the algorithm used for clustering can be DBSCAN, and the updating of the model parameters of the teacher network and the student network is based on EMA index.

When the target domain training model is used for training, the specific implementation method comprises the following steps: imaging the target domain data

Inputting the data into teacher-student network (teacher network and student network), and generating global characteristics by teacher network

And local features

The second classifier converts the two feature vectors into predicted values

And

i.e. a soft label (soft label). The global features and the local features are input into a clustering model (DBSCAN) after being spliced (concat), and hard labels (hard labels, also called pseudo labels) are generated in the clustering. The student network is characterized by

And

inputting the data into a third classifier for classification to obtain predicted values of

And

a new classifier (second classifier, third classifier) is created anew based on the number of clusters at the beginning of each epoch. And taking the normalized mean characteristic of each cluster from the local branch to initialize the classifier of the local branch, and taking the normalized maximum characteristic from the global branch to initialize the classifier of the global branch.

After the target domain training, the student network and the third classifier are combined to form the unsupervised pedestrian re-identification device. And receiving an image containing the pedestrian to be identified through a student network, and outputting an image category for identifying the identity of the pedestrian through a third classifier.

FIG. 5 is a schematic diagram of a method for updating a parameter model for a teacher network and a student network according to one embodiment of the present disclosure.

And creating a teacher-student network model in the target domain, wherein the network model used by the teacher network and the student network is the same as that used by the pre-training model. The teacher network model weight is the student network model EMA weight, as shown in FIG. 5. The EMA weights during the training phase are defined as:

wherein t represents the training stage, alpha represents a smoothing coefficient and controls the self-integration speed of the weight.

Fig. 6 is a schematic diagram of an asymmetric branch supervision method for a student network and a teacher network according to one embodiment of the present disclosure.

As shown in fig. 6, the present invention employs an asymmetric branch supervision approach. Each branch of students is supervised by a teacher branch having a different structure, i.e. a soft label generated by the global branch of the teacher network supervises a local branch of the student network, and a soft label generated by the local branch of the teacher network supervises a global branch of the student network. The weights between the paired teachers and students can keep diversity.

It can be seen that the unsupervised pedestrian re-identification training method based on the multi-branch network provided by the disclosure firstly comprises a source domain pre-training stage, wherein a labeled image is input to a constructed pre-training model, characteristics are extracted, and a cross entropy loss function and a triplet loss function optimization network are calculated. And migrating the pre-training model parameters trained in the pre-training stage to a target domain and initializing teacher and student network model parameters. In the target domain training stage, unlabeled images are input into a network structure to extract features, and then the salient features are extracted according to an attention mechanism. The global features and the local features are respectively extracted through the constructed teacher-student network, the teacher network model generates soft pseudo labels and hard pseudo labels according to the global features and the local features, and the student network model generates predicted values. And constructing a loss function training network in a cross-branch supervision mode.

Two sets of experiments were performed in the present invention. One set is in unsupervised adaptive methods, the cross-domain datasets used are Market → DukeMTMC, DukeMTMC → Market, Market → MSMT and DukeMTMC → MSMT, and the experimental results are shown in table 1. The other set was in a completely unsupervised approach using the data sets Market and duke mtmc and the experimental results are shown in table 2. The experimental results of the unsupervised pedestrian re-identification method based on the multi-branch network provided by the invention are compared with those of other methods as follows.

TABLE 1 comparison of unsupervised adaptive methods

TABLE 2 comparison of completely unsupervised methods

Compared with the prior art, the unsupervised pedestrian re-identification method, the unsupervised pedestrian re-identification training method and the unsupervised pedestrian re-identification training device based on the multi-branch network have the following technical effects.

Firstly, for the problem that the attention of the traditional deep learning method to the secondary obvious detailed information is insufficient, in order to extract multi-scale features, the OSNet is selected as a backbone network, the size of a receptive field is adaptively adjusted according to input information, different spatial scales are captured, and any combination of multiple scales is packaged, so that the distinguishing pedestrian features are learned. An attention mechanism is added to fuse with multi-scale in order to obtain salient features.

Secondly, in terms of processing the noise tag, compared with the common method for processing the noise tag, the method comprises the following steps: by training paired networks so that each network helps correct each other, paired models with the same structure tend to converge on each other and fall into local minima, the present disclosure enhances appearance feature diversity by extracting global and local features by designing asymmetric network structures in a teacher-student network structure-based model, thereby obtaining better cluster-based hard labels.

Third, in deep learning, data is usually labeled as hard labels, but in fact, the same data contains different types of information, and direct labeling as a hard label results in loss of a large amount of information, thereby affecting the effect obtained by the final model. To reduce the pseudo tag noise, the present disclosure adds soft pseudo tags through network prediction. The global branch of the student network is supervised by the local branch of the teacher network, and the local branch of the student network is supervised by the global branch of the teacher network.

Fig. 7 is a flowchart illustrating an unsupervised pedestrian re-identification method based on a multi-branch network according to an embodiment of the disclosure.

As shown in fig. 7, the unsupervised pedestrian re-identification method S100 based on the multi-branch network includes the following steps.

In step S102, a feature map of the image is extracted based on the image of the contained pedestrian to be identified.

In step S104, salient features of the image are extracted based on the image feature map.

In step S106, a global feature of the image is extracted based on the salient feature of the image.

In step S108, local features of the image are extracted based on the salient features of the image.

In step S110, the identity of a pedestrian included in the image is determined based on the global features of the image and the local features of the image.

As shown in fig. 8, the unsupervised pedestrian re-identification device based on the multi-branch network comprises the following modules.

And the feature map extraction module 1002 is used for extracting a feature map of the image based on the contained image of the pedestrian to be identified.

And a salient feature extraction module 1004 for extracting salient features of the image based on the image feature map.

The global feature extraction module 1006 extracts global features of the image based on the salient features of the image.

The local feature extraction module 1008 extracts local features of the image based on the salient features of the image.

The pedestrian identity determination module 1010 determines the identity of a pedestrian included in the image based on the global features of the image and the local features of the image.

It should be noted that, in fig. 7 and fig. 8, implementation details corresponding to each step or each module are consistent with relevant implementation details related to other embodiments of the present disclosure, and are not described herein again.

Fig. 8 illustrates an example diagram of an apparatus employing a hardware implementation of a processing system. The apparatus may include corresponding means for performing each or several of the steps of the flowcharts described above. Thus, each step or several steps in the above-described flow charts may be performed by a respective module, and the apparatus may comprise one or more of these modules. The modules may be one or more hardware modules specifically configured to perform the respective steps, or implemented by a processor configured to perform the respective steps, or stored within a computer-readable medium for implementation by a processor, or by some combination.

The hardware architecture may be implemented using a bus architecture. The bus architecture may include any number of interconnecting buses and bridges depending on the specific application of the hardware and the overall design constraints. The bus 1100 couples various circuits including the one or more processors 1200, the memory 1300, and/or the hardware modules together. The bus 1100 may also connect various other circuits 1400, such as peripherals, voltage regulators, power management circuits, external antennas, and the like.

The bus 1100 may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one connection line is shown, but no single bus or type of bus is shown.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and the scope of the preferred embodiments of the present disclosure includes other implementations in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the implementations of the present disclosure. The processor performs the various methods and processes described above. For example, method embodiments in the present disclosure may be implemented as a software program tangibly embodied in a machine-readable medium, such as a memory. In some embodiments, some or all of the software program may be loaded and/or installed via memory and/or a communication interface. When the software program is loaded into memory and executed by a processor, one or more steps of the method described above may be performed. Alternatively, in other embodiments, the processor may be configured to perform one of the methods described above by any other suitable means (e.g., by means of firmware).

The logic and/or steps represented in the flowcharts or otherwise described herein may be embodied in any readable storage medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.

For the purposes of this description, a "readable storage medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the readable storage medium include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable read-only memory (CDROM). In addition, the readable storage medium may even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in the memory.

It should be understood that portions of the present disclosure may be implemented in hardware, software, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps of the method implementing the above embodiments may be implemented by hardware that is instructed to implement by a program, which may be stored in a readable storage medium, and when executed, includes one or a combination of the steps of the method embodiments.

In addition, each functional unit in the embodiments of the present disclosure may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a separate product, may also be stored in a readable storage medium. The storage medium may be a read-only memory, a magnetic or optical disk, or the like.

In the description herein, reference to the description of the terms "one embodiment/implementation," "some embodiments/implementations," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment/implementation or example is included in at least one embodiment/implementation or example of the present application. In this specification, the schematic representations of the terms described above are not necessarily the same embodiment/mode or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments/modes or examples. Furthermore, the various embodiments/aspects or examples and features of the various embodiments/aspects or examples described in this specification can be combined and combined by one skilled in the art without conflicting therewith.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, "plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

It will be understood by those skilled in the art that the foregoing embodiments are merely for clarity of illustration of the disclosure and are not intended to limit the scope of the disclosure. Other variations or modifications may occur to those skilled in the art, based on the foregoing disclosure, and are still within the scope of the present disclosure.

Claims

1. An unsupervised pedestrian re-identification training method based on a multi-branch network is characterized by comprising the following steps:

source domain training, namely inputting images with labels into a pre-training model for training in the source domain training stage to obtain a trained pre-training model; and

in the target domain training stage, inputting an image without a label into a target domain training model, and obtaining a pedestrian re-identification model after training;

2. The unsupervised pedestrian re-recognition training method based on the multi-branch network as claimed in claim 1, wherein the training process of the pre-training model comprises:

inputting the first global feature and the first local feature into a first classifier, and respectively obtaining predicted values corresponding to the first global feature and the first local feature, wherein the predicted values are used for identifying the category of the image which is classified by the first classifier and corresponds to the first global feature and the first local feature; and

constructing a pre-training loss function, calculating a training effect through the pre-training loss function, and repeatedly training the pre-training model until the calculated value of the pre-training loss function reaches an expected index;

3. The unsupervised pedestrian re-recognition training method based on the multi-branch network as claimed in claim 2, wherein the pre-training model is composed of:

the main network model is connected with each branch network model in series, receives the image input to the pre-training model, processes the image and inputs the obtained image characteristics to each branch network model; and

4. The multi-branch network-based unsupervised pedestrian re-recognition training method of claim 3, wherein the attention module comprises:

the channel attention module is used for receiving image characteristics, wherein the image characteristics are obtained after an image input to the backbone network model is processed by the backbone model, the characteristics of the image are subjected to global maximum pooling operation and average pooling operation respectively to obtain two first characteristic graphs, the two first characteristic graphs are subjected to splicing operation, and a space attention weight value is generated through convolutional layer and normalization operation; and

5. The unsupervised pedestrian re-recognition training method based on the multi-branch network as claimed in claim 1, wherein the target domain training model comprises:

the clustering model is used for receiving the characteristic output by the teacher network and spliced by the second global characteristic and the second local characteristic, and generating a hard label through clustering, wherein the hard label is used for identifying the category of the corresponding image in the target domain training set; and

the target domain training loss function calculation module is used for constructing a target domain training loss function and calculating a training effect through the target domain training loss function

The structures of the teacher network and the student network are consistent with the structures of the pre-training models in the source domain training stage, and the pre-training models are trained in the source domain training stage.

6. An unsupervised pedestrian re-recognition training device based on a multi-branch network is characterized by comprising:

a source domain training module comprising;

a trunk network model serially connected to each branch network model;

the local branch network model is used for splicing all the characteristics together after average pooling to obtain local characteristics; and

a target domain training module comprising:

the third classifier is used for receiving a third global feature and a third local feature output by the student network and outputting a predicted value of an image corresponding to the third global feature and the third local feature, wherein the predicted value is used for identifying the category of the corresponding image in the target domain training set; and

and the target domain training loss function calculation module is used for constructing a target domain training loss function and calculating a training effect through the target domain training loss function.

7. An unsupervised pedestrian re-identification device based on a multi-branch network, comprising:

the student network of claim 6 and a third classifier, the student network receiving an image containing a pedestrian, the third classifier outputting an image class, the image class corresponding to an identity of the pedestrian contained in the image.

8. An unsupervised pedestrian re-identification method based on a multi-branch network is characterized by comprising the following steps:

inputting an image or an image sequence containing the pedestrian into the multi-branch network based unsupervised pedestrian re-identification device according to claim 7, and obtaining the category of the pedestrian through the identification of the multi-branch network based unsupervised pedestrian re-identification device.

9. An electronic device, comprising:

a memory storing execution instructions; and

a processor executing execution instructions stored by the memory to cause the processor to perform the method of any of claims 1 to 5 or 8.

10. A readable storage medium having stored therein execution instructions, which when executed by a processor, are configured to implement the method of any one of claims 1 to 5 or 8.