CN111242064A

CN111242064A - Pedestrian re-identification method and system based on camera style migration and single marking

Info

Publication number: CN111242064A
Application number: CN202010053330.2A
Authority: CN
Inventors: 李强; 高玲; 吴绍君; 李杨
Original assignee: Shandong Normal University
Current assignee: Shandong Normal University
Priority date: 2020-01-17
Filing date: 2020-01-17
Publication date: 2020-06-05

Abstract

The invention discloses a pedestrian re-identification method and a system based on camera style migration and single marking, which comprises the following steps: acquiring a plurality of non-label images to be subjected to pedestrian re-identification; marking the pedestrian to be identified in one of all the unlabelled images to be subjected to pedestrian re-identification; inputting the marked images into a pre-trained cycleGAN network, and outputting a plurality of camera style migration images corresponding to each image in the marked images to realize data amplification; marking the pedestrian of the camera style migration image corresponding to the marked image; the unlabelled image is put into a pre-trained cycleGAN network to realize data amplification; and inputting the marked image, the unmarked image, the camera style transition image corresponding to the marked image and the camera style transition image corresponding to the unmarked image into a pre-trained CNN network, and outputting the identification result of the pedestrian to be identified in each image in the unmarked image.

Description

Pedestrian re-identification method and system based on camera style migration and single marking

Technical Field

The disclosure relates to the technical field of pedestrian re-identification, in particular to a pedestrian re-identification method and system based on camera style migration and single labeling.

Background

The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.

Pedestrian re-identification (re-ID) is a technique for finding a particular pedestrian in an image library or video sequence by using computer vision correlation techniques, i.e., a given pedestrian is targeted from other multiple surveillance camera data or pictures for the given pedestrian given a pedestrian of interest.

In the course of implementing the present disclosure, the inventors found that the following technical problems exist in the prior art:

in the study of pedestrian re-identification, a commonly used method is supervised learning, a training set of the supervised learning requires input and output pairs, namely features and targets with labels, the labels are labeled manually, a machine finds the relationship between the features and the target labels through data training, when a new feature is input, the labels can be judged, namely a model is learned from a given training data set, and then a predicted value of new data is obtained according to the model. In order to obtain a relatively good recognition model, a large number of pictures and corresponding manually labeled labels are required, and the manual labeling process is labor-consuming and time-consuming. Unlike supervised learning, unsupervised learning has no label on input data and no definite result, and the intrinsic relation between input features is learned directly.

Existing single sample-based studies have mostly focused on selection of pseudo-tags. M.ye, a.j.ma, l.zheng, j.li, and p.c.yuen, "Dynamic label mapping for Unsupervised video-identification," in proc.ieee int.conf.via., oct.2017, pp.5152-5160, and h.fan, l.zheng, c.yan, and y.yang, "Unsupervised person re-identification," marketing and fine-tuning, "ACM trans.multimedia combination.command.l., app.14, vol.4, p.83, oct.2018. a static policy is used to determine the number of pseudo-tags, and then the next training is performed. These algorithms always fix the size of the training set of pseudo-labels during the iteration process.

Wu Y, Lin Y, Dong X, et al, extension the Unknown gradient, One-Shot Video-Based Person Re-identification by Stepwise Learning [ C ]//2018 IEEE/CVFConference on Computer Vision and Pattern Recognition (CVPR). IEEE,2018, a progressive Learning framework is proposed, which makes better use of unlabeled data for training pedestrian Re-identification in limited samples. The algorithm initially trains a CNN model on data of single labeled samples, then generates a pseudo label for all unlabeled samples, and selects some most reliable pseudo label data for training according to the prediction confidence. Unlike the former method, the number of pseudo label training sets is not fixed, but is continuously enlarged according to the sampling strategy. In contrast, dynamically increasing the number of pseudo-tags achieves better results in the iterative process. However, none of the above methods consider the pedestrian retrieval problem across cameras.

Pedestrian re-identification is a problem of searching pictures across cameras, and under different cameras, the difference of the same person under different cameras is large due to different shooting angles, different color differences caused by building shielding, different picture background contents and the like. Traditional single sample mark only marks a certain person under a certain camera, lacks the differentiation study that the camera was striden to the picture, has leaded to the model to stride the problem that the camera recognition rate is not high. The reason for the weakness of the previous method model is two points: 1) the data volume of single sample labeling is too small; 2) the model is overfitting to a certain camera, and cannot adapt to a data set crossing the cameras. The pictures of the same pedestrian are greatly different under different cameras, and different pedestrians can be similar under the same camera.

Disclosure of Invention

In order to overcome the defects of the prior art, the disclosure provides a pedestrian re-identification method and system based on camera style migration and single marking;

in a first aspect, the present disclosure provides a pedestrian re-identification method based on camera style migration and single labeling;

the pedestrian re-identification method based on camera style migration and single marking comprises the following steps:

acquiring a plurality of non-label images to be subjected to pedestrian re-identification;

marking the pedestrian to be identified in one of all the unlabelled images to be subjected to pedestrian re-identification;

inputting the marked images into a pre-trained cycleGAN network, and outputting a plurality of camera style migration images corresponding to each image in the marked images to realize data amplification; marking the pedestrian of the camera style migration image corresponding to the marked image;

the unlabelled images are sent to a pre-trained cycleGAN network, and a plurality of camera style migration images corresponding to each image in the unlabelled images are output to realize data amplification;

and inputting the marked image, the unmarked image, the camera style transition image corresponding to the marked image and the camera style transition image corresponding to the unmarked image into a pre-trained CNN network, and outputting the identification result of the pedestrian to be identified in each image in the unmarked image.

In a second aspect, the present disclosure also provides a pedestrian re-identification system based on camera style migration and single annotation;

pedestrian re-identification system based on camera style migration and single marking includes:

an acquisition module configured to: acquiring a plurality of non-label images to be subjected to pedestrian re-identification;

a marking module configured to: marking the pedestrian to be identified in one of all the unlabelled images to be subjected to pedestrian re-identification;

a data amplification module configured to: inputting the marked images into a pre-trained cycleGAN network, and outputting a plurality of camera style migration images corresponding to each image in the marked images to realize data amplification; marking the pedestrian of the camera style migration image corresponding to the marked image;

an identification module configured to: and inputting the marked image, the unmarked image, the camera style transition image corresponding to the marked image and the camera style transition image corresponding to the unmarked image into a pre-trained CNN network, and outputting the identification result of the pedestrian to be identified in each image in the unmarked image.

In a third aspect, the present disclosure also provides an electronic device comprising a memory and a processor, and computer instructions stored on the memory and executed on the processor, wherein when the computer instructions are executed by the processor, the method of the first aspect is performed.

In a fourth aspect, the present disclosure also provides a computer-readable storage medium for storing computer instructions which, when executed by a processor, perform the method of the first aspect.

Compared with the prior art, the beneficial effect of this disclosure is:

the method and the device for expanding the data provide a solution idea for data expansion by using image style migration aiming at the condition that the model identification performance is poor due to insufficient data volume in the single-labeling problem, and provide a new scheme for expanding the sample for the single-labeling step-by-step iterative experiment.

During training, the method provides that the CNN model is initialized by the original data set single label sample and the camera style migration data set single label sample together.

The method comprises the steps of searching for a credible picture and giving a pseudo label, calculating picture characteristics of an original data set, randomly selecting a part of pictures generated by the original picture and the camera style migration to give the pseudo label, and putting the part of the pictures into an iterative training model.

Most of the existing pedestrian re-identification methods rely on complete data labeling, namely, data of people in each training set under different cameras need to be labeled. However, for practical monitoring scenes, such as monitoring videos in a city, it is very costly to manually label the pedestrian label of each video segment from a plurality of cameras. Therefore, we try to label only the samples with a single label, and let the network learn to use those unlabelled samples by itself. That is, for each pedestrian, we only need to label one of the videos or one of the pictures, and the rest videos or pictures are searched by the algorithm itself.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application.

FIG. 1 is a flow chart of a method of the first embodiment;

FIG. 2(a) to FIG. 2(c) are schematic diagrams of image style migration and data augmentation of the first embodiment; cam1 represents the original real picture shot by the camera1 in the data set, c1s2 represents the pedestrian shot by the camera1 and the generated picture of the shooting style of the camera 2;

FIGS. 3(a) to 3(c) are schematic views of the operation principle of the cycleGAN of the first embodiment;

FIGS. 4(a) -4 (b) are schematic diagrams illustrating initialization of the model by a single-label scheme according to the first embodiment;

FIGS. 5(a) -5 (d) are the prediction accuracy and recall of the selected pseudo tag candidate set of the first embodiment;

Detailed Description

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

The first embodiment provides a pedestrian re-identification method based on camera style migration and single annotation;

s1: acquiring a plurality of non-label images to be subjected to pedestrian re-identification;

s2: marking the pedestrian to be identified in one of all the unlabelled images to be subjected to pedestrian re-identification;

s3: inputting the marked images into a pre-trained cycleGAN network, and outputting a plurality of camera style migration images corresponding to each image in the marked images to realize data amplification; marking the pedestrian of the camera style migration image corresponding to the marked image;

s4: and inputting the marked image, the unmarked image, the camera style transition image corresponding to the marked image and the camera style transition image corresponding to the unmarked image into a pre-trained CNN network, and outputting the identification result of the pedestrian to be identified in each image in the unmarked image.

As one or more embodiments, the pre-trained cycleGAN network, the step of training comprising:

s31: constructing a cycleGAN network;

s32: constructing a training set; the training set includes: a Market-1501 data set or a DukeMTMC-reiD data set;

s33: and taking the image containing the pedestrian b acquired by one camera a in the training set as an input value of the cycleGAN network, taking the images containing the pedestrian b acquired by all the cameras except the camera a in the training set as output values of the cycleGAN network, and training the cycleGAN network to obtain the trained cycleGAN network.

As one or more embodiments, the pre-trained CNN network, the step of training comprising:

s41: performing primary training on the CNN network by using an ImageNet data set; in the primary training process, the image is used as an input value of the CNN network, and the pedestrian label of the image is used as an output value of the CNN network;

s42: performing secondary training on the CNN network after primary training by using a real label data set; in the secondary training process, the image in the real label data set is used as an input value of the CNN network after primary training, and the pedestrian label of the image in the real label data set is used as an output value of the CNN network after primary training;

s43: performing three-level training on the CNN network after the second-level training by using a plurality of unlabeled images to be subjected to pedestrian re-identification; in the third-level training process, all the unlabeled images are used as input values of the CNN network after the second-level training, and the CNN network after the second-level training outputs a pseudo label set; the reliability of the pseudo labels is discriminated, the pseudo labels with the reliability higher than a set threshold value are reserved, and the pseudo labels with the reliability lower than the set threshold value are removed;

s44: inputting the data set with the real label and the data set with the high-reliability pseudo label into the CNN network after the three-stage training for final training; and obtaining the trained CNN network.

Further, before the step of performing secondary training on the CNN network after the primary training by using the real label dataset in S42, the method further includes:

and inputting the data set with the real label into a trained cycleGAN network for data amplification.

Further, in S43, before performing the third-level training step on the CNN network after the second-level training, using a plurality of unlabeled images to be subjected to pedestrian re-identification, the method further includes:

and inputting a plurality of unlabeled images to be subjected to pedestrian re-identification into a trained cycleGAN network for data amplification.

Further, in S43, the method for discriminating the reliability of the pseudo tag includes:

calculating the distance between the pedestrian image feature corresponding to the pseudo tag and the pedestrian image feature corresponding to the real tag of the same pedestrian, and if the distance is smaller than a set threshold value, indicating that the current pseudo tag is a reliable pseudo tag; otherwise, the current pseudo label is represented as an unreliable pseudo label.

And calculating Euclidean distance between the samples according to the distance (unlabeled samples and labeled samples) in the feature space, wherein the closer the Euclidean distance to the labeled samples is, the higher the reliability is.

And performing single sample labeling setting, and respectively setting a labeled data set:

where x represents input image data and y represents identity tag information representing the image data.

Setting a non-tagged dataset to a non-tagged dataset

A non-tagged dataset has only input data without a corresponding tag. Data set size | L | ═ n_l+k，|U|＝n_l+k+2u。

Regarding the setting of ordinary single sample labeling, N pedestrian pictures of M pedestrians, K cameras are shared in a certain pedestrian re-identification data set O. For each pedestrian, a picture of a certain pedestrian is selected in the camera1 and is given an initialized label, if the picture of the pedestrian is not shot in the camera1, a picture is randomly selected from the next camera for labeling, and the purpose of this is to ensure that each pedestrian has a picture labeled with the label for initialization.

With respect to data set E migrated via data stylesAnd the total K x N pictures and the M pedestrians select and mark the pictures of each pedestrian under each camera style, so that the total K x M labeled pictures are used for initialization. In the evaluation stage, the trained CNN model is used for inquiring data and image database data, and the output result is a ranking list in all the image database data obtained by Euclidean distance between the inquired data and the image database data. When using unlabeled data, for each unlabeled data x_iPredicting a pseudo label by belonging to U

Mixing L with^e、S^tAnd M^tRespectively noted as a style migration tag dataset, a pseudo tag dataset, and an index dataset. Our approach is style migration + common single sample annotation co-training CNN models.

CycleGAN, unlike general GAN, CycleGAN does not require a matching training image and can generate another scene pattern map from a specific scene pattern map. The cycleGAN has two discriminators (D)_XAnd D_Y) And two generators (G)_(XY)And G_(YX)) The composition can not only complete the conversion from the data set X to the data set Y, but also meet the conversion from Y to X.

in fig. 3(a) -3 (c), X, Y are two different styles of pedestrian picture data sets, respectively, and X, Y in the present disclosure represents two different camera styles of picture sets.

Fig. 3(a) shows the interconversion of the two style data sets X, Y, followed by the determination of confrontation of the actual picture with the generated picture by the respective discriminators.

A picture X of the X data set is input in FIG. 3(b), and is generated by a generator G_XYGenerating Y-dataset style pictures

Then by the generator G_YXWill be provided with

Conversion of pictures into data set X-style pictures

Such a cyclic process achieves the conversion of the image style of the dataset, and the new image is generated with the pedestrian subject of dataset X and the image style background of dataset Y. Similarly, fig. 3(c) shows the same operation of data set interchange, so that the data set expansion can be realized by the cyclic consistency countermeasure network.

CycleGAN is a unidirectional GAN of X- > Y plus Y- > X. Two GANs share two producers, each with a discriminator, and a total of two producers and two discriminators so the CycleGAN has four losses.

Wherein, the discriminator loss of X- > Y is:

the discriminator loss for Y- > X is:

generator G_XYThe loss is:

generator G_YXThe loss is:

the final loss of CycleGAN is therefore the sum of the four losses expressed as:

L(G_XY,G_YX,D_X,D_Y)＝L_GAN(G_XY,D_Y,X,Y)+L_GAN(G_YX,D_X,Y,X)+L_CYC(G_XY)+L_CYC(G_YX) (5)

firstly, two cameras with different styles are taken as two different domain spaces, and an image set M_ijInterconversion is performed by CycleGAN.

And i, j is less than or equal to K, i represents the current camera, and j represents the rest of other cameras. After all image sets of different camera styles are mutually converted, the data set at the moment is changed into K times of the original data set.

Since there is an ID tag between the real picture and the generated picture, in the experiment, we use ID-discrete embedding (IDE) as the re-ID CNN model. Using softmax loss, the IDE considers the re-identified training as an image classification task. In the implementation process, the input images are all uniformly adjusted to 256 × 128 due to the low resolution of the generated pictures. Using ResNet50 as the backbone, fine tuning was done using ImageNet with the training model, deleting the last 1000-dimensional classification layer and adding two fully connected layers. And generating a new training sample by using the cycleGAN, expanding the data set in one data set, and regarding two different camera styles as two different domain spaces.

Using the CycleGAN to learn the style of each camera, and completing the mutual conversion identity loss of the image-image styles in every two cameras as follows:

finally, for K × N data sets captured by K cameras for M pedestrians, we obtained K × N-K × M training pictures.

After the cross-camera style picture migration is completed, all pictures are divided into an original picture data set O and a style migration picture data set E.

Then, the labeled data sets are respectively set

And

Similarly, according to the original real picture and the non-paired label-free data set of the generated picture setting

And

unpaired unlabeled dataset has only input data, no label information.

Setting pseudo label data sets S and S according to an original image data set and a data set after cross-camera style migration^eThe predicted identity label is predicted by a CNN model trained by initial label data

And

the goal of single labeling is to fully utilize a large amount of unlabeled data, and one of the methods is to select the unlabeled data with useful value according to some methods and label the unlabeled data as a pseudo label. After the data set expansion is completed, the CNN model is updated in two steps. The first step is to migrate the label data set L by using the original label data set L and the data style^eAnd training a CNN model by the four parts of the pseudo label set and the label-free set. The second step selects some reliable pseudo labels as candidates from the large amount of unlabeled data according to a prediction reliability criterion. The CNN model is trained in the first iteration using only the label data set, during which all unlabeled data is not assigned a pseudo label. In the iterative process, the pseudo label candidate set is continuously enlarged. Then through style migration + single labeled squareThe method utilizes four parts of data to learn step by step, and finally obtains a robust model. The pseudo label is distributed to the unmarked candidate object by the identity label of the nearest marked field in the feature space, and the reliability of the pseudo label is measured according to the distance between the identity label and the unmarked candidate object.

The specific method is that firstly two marking data sets L and L are used^eThe initial model is trained, then the initial model is used to predict the pseudo-label that produces the unlabeled data, and all the pseudo-labeled data is put into a candidate set, which is continuously updated. Selecting the most reliable sample from the candidate set to give the most reliable sample a pseudo label, training the data containing the label and the data containing the pseudo label together to generate a more robust CNN model, wherein the selection process of the pseudo label is dynamic, the number of the pseudo labels is gradually increased, an iterative updating link is entered, and the CNN model is gradually updated step by step to make the CNN model more robust.

As shown in fig. 1, the original data set is first expanded by K (K is the number of cameras in the data set) times, and then the CNN network is initialized. And performing sample prediction on the unlabeled data through an initial CNN network, selecting a sample with high reliability to enter a candidate set, and then selecting a part of the sample as a pseudo label. After the selection of the pseudo label is finished, the CNN model is retrained together with the original label data, the whole process is dynamically changed, and the residual label-free sample data set is automatically updated after one iteration is finished. The number of pseudo-tags is increased gradually as the iteration progresses.

Network training:

the updating step of the model is first introduced. In the t-th iteration, four kinds of data are used for training, namely a label set L of the original data set and a label set L of the style migration data set^ePseudo mark set S^tAnd a tagless tag set M^t. The two label data sets are labeled by a single sample method, so the label sets L and L^eWith reliable marking information. The pseudo label set contains the most reliable pseudo labels, so the pseudo label set S^tHas relatively reliable tag information. The tags and pseudo-tag sets were optimized using idclasfier and Cross-entry loss. Index set M^tThere is no reliable pseudo tag norReliable utilization information, M^tFor sorting out the remaining unlabelled data in the iterative process, as the iterative process progresses, M^tIs also dynamically changed, and finally, The Exclusive Loss is used for optimizing The CNN model.

a. Training of tagged pictures + tagged style transition pictures

The most critical step in the whole transfer learning + single sample gradual learning process is to utilize the tag data set with the highest efficiency. In the labeling of data sets L and L^eThe Cross-camera data migration method has the advantages that the real identity tags exist in the data collection, Cross-entry loss is uniformly used for optimizing the tag data of the two parts, when Cross-entry loss is used for optimizing the single-label tag data, the data collection is subjected to Cross-camera data migration, the original data collection is expanded to be K times (K is the number of cameras of the original data collection), and therefore the original single-label data collection is changed to be K times. For ease of understanding, labeled L and L in the representation^eIn the real training process, the two parts of data are trained together. The objective function is used for this part:

wherein x is_iAnd y_iThe original real picture data sets represent the ith input image data and the ith corresponding identification tag, respectively. f is an identity classifier parameterized by ω, and f is used for classifying a feature embedding function φ, which is parameterized by θ as an embedding function. l_CERepresenting a predictive tag f and a true identity tag y_iA smaller value indicates a more similar to the true identity.

The process is particularly important in the whole process of gradually iterating and training the model and fully using the unmarked samples, because the initial iteration process does not need pseudo-labeled data, a good initial training model plays a very key role in the subsequent training of adding pseudo-labeled data, and the pseudo-labeled data can be more fully utilized. The method has K times of data quantity of the original data set, and provides a more robust initial model for the whole process.

b. Training of pseudo-tagged pictures + pseudo-tagged style migration pictures

In this part of data training, our method is divided into two parts, the first part is the original data set image data, and the second part is the camera style migration data set image data. And calculating characteristics according to the pictures of the original data set, and randomly selecting one of K cameras for training for each picture in all picture data. For the part of data, the model which can not be optimized well depends on the credibility of the pseudo label set, the pseudo labels with high credibility are selected to be trained together, the robustness of the model is increased, and otherwise, if unreliable samples are selected to be used as the pseudo labels to be trained, the model is seriously damaged. Pseudo label set S formed by selecting candidate and endowing pseudo labels to efficient sampling standard^tAnd the system also has more reliable label supervision information, so Cross-error is also used for carrying out optimization training according to the rule processing of the labeled set:

wherein s is_iE {0,1} is x_iSelection index of (1), x_iWhere this means that the sample is not labeled, s_iGenerated by previous label population process, which decides whether to select pseudo label data

The method is used for identity classification training.

c. Training of unlabeled picture + unlabeled style transition picture

In the utilization of a large amount of non-labeled data, the most common method is to use exclusive loss as auxiliary loss for self supervision and extract effective information to learn and judge the representation. In the process, the difference of input images is mainly learned to distinguish samples, and the difference between pedestrian images is used for extracting weak supervision information. Using an objective function can make some data x in the non-marked set_iPush away other data x in feature space_j(i≠j)。

Feature library

All target image features are stored and updated after each iteration. M_iDenotes the ith column in M, and

as data x_iL2-normalized feature embedding. Because M_i-M_j||²＝2-M_i ^TM_jMaximizing data x_iAnd x_jThe Euclidean distance between the two is equal to the minimum cosine similarity M_i ^TM_jThe above equation is optimized by a softmax-like loss:

wherein the content of the first and second substances,

for our CNN model, the effect is to extract D-dimensional features for each picture. Theta represents the weight of the re-ID model, the hyperparameter tau represents the temperature influencing factor of the sotfmax function, and higher temperatures tau lead to weaker probability distributions. After each iteration, M is updated as follows:

where the hyperparameter μ is the update rate of M, μ is not fixed to a constant, since M requires a smaller μ to accelerate the update of M at the beginning of training, and as epochs increase, M must become gradually stable, in which case a larger μ is required, so the whole process μ is gradually increased.

Exclusivity loss is a self-supervised auxiliary loss, mainly used for learning discriminant representations from unlabeled data without identifying identity tags. In the process of iteratively optimizing the model, exclusivity loss mainly causes the model to learn the differences of the input images to be distinguished. So that more attention needs to be paid to the details of the input identity throughout the process. More samples are accessed in an iterative process and the differences between pedestrian images are exploited to provide some useful supervisory information.

And (3) pseudo label assignment:

the selection and value taking process of the pseudo label plays a crucial role in the process of fully utilizing the unmarked data. In the label estimation screening process, the most common method is to extract unmarked data according to the classification loss, but the predicted value of the classification loss cannot adapt to detection evaluation well in practical application, and the classifier is easy to generate overfitting for single-labeled sample data. And (3) according to the distance in the feature space as a reference standard of the credibility of the pseudo label, adopting a Nearest Neighbor (NN) classifier, and distributing the pseudo label to the unmarked data according to the nearest marked neighbor. Euclidean distances are calculated from the input features of the original dataset, for all unlabeled data x by the following formula_i∈U∪U^eEvaluation was carried out:

the dissimilarity cost is evaluated according to the following formula:

d(θ；x_i)＝||φ(θ；x_i)-φ(θ；x^*)|| (14)

in the t-th iteration, m is selected by the following equation^tThe unlabeled sample closest to the labeled sample:

where K denotes the number of cameras of a certain original picture data set, m^tRepresents the t-th iterationThe size of the pseudo tag set is selected. Then the best predicted value y is obtained^*Giving it a false identity mark

By setting a dummy tag

And putting the model into an iterative training optimization model.

With respect to the iterative scheme, equations (9) and (11) are first optimized at each iteration by

Labels are estimated for unlabeled data and reliable samples are selected for training by equation (15). Through m_t←m_t-1+p×(n_l+k+2u), and p ∈ (0,1) is an amplification factor, which controls the sampling size of the candidate set in the iterative process.

The experimental scheme is as follows:

data set:

market-1501: the data set was collected on a Qinghua university campus with images from 6 different cameras, one of which was a low pixel. While the data set provides a training set and a test set. The training set contained 12,936 images and the test set contained 19,732 images. The image is automatically detected and cut by the detector, containing some detection errors (close to the actual use). There were 751 persons in the training data and 750 persons in the test set. So there are an average of 17.2 training data per person in the training set.

DukeMTMC-reiD: the images are from 8 different cameras. The data set provides a training set and a test set.

The training set contained 16,522 images and the test set contained 17,661 images. There were 702 total training data, and an average of 23.5 training data per person.

Evaluation indexes are as follows: we used cumulative matching feature (CMC) curves and mean accuracy (maps) to evaluate the performance of each method. The average Accuracy (AP) is calculated from its accuracy calling curve. The mAP is the average of the average precision in all queries. We list rank-1, rank-5, rank-10, rank-20 scores to represent CMC curves. The CMC score reflects the retrieval accuracy and the mAP reflects the recall rate.

Experimental setup

For this single annotation experiment based on data augmentation, we have added pictures of image interconversion under different styles of cameras in all datasets, in addition to randomly selecting one image from camera1 for each identity as initialization. If a camera does not have any record of an identity under it, we will randomly choose a sample from the next camera to ensure that there is a sample for each identity to initialize.

Details of the experiment

We first remove the last classification layer of ResNet-50 as a feature embedding model. Initialization was performed by ImageNet pre-training model. An additional full connection layer with batch normalization function and a classification layer are added on the top of the CNN feature extractor to pass through the label and pseudo label loss optimization model. For exclusivity loss, we process the unmarked features through the full join layer of batch join normalization and then perform the normalization operation.

TABLE 1 comparison with the state of the art

Baseline (supervised) showed the best performance of the full annotation compared to the state of the art in the Market-1501 dataset. Baseline (ONE-EXAMPLE) shows the initial model trained with the single annotation approach. TJOF means that THEJOINT OBJECTIVE FUNCTION is the latest method of Wu et al in the study of image dataset single annotation. Ours is our method.

baseline (supervised) is an experimental result under the training of hundred percent of labeled data, and no unlabeled data exists in the process, but in an actual scene, complete manual labeling is not practical. According to the method, the migration data also have the label marked by the original data sheet through the camera style migration learning, so that the data volume marked by the original data sheet is K times of that of the original data sheet in the initialization process. The initialization model we have trained in this way is more robust. According to the data set Market-1501, the experimental result shows that the hit rate of rank-1 is improved by 8.7%, the hit rate of rank-5 is improved to 64.5%, the hit rate of rank-5, rank-10 and rank-20 is also improved to a large extent, and finally the hit rate of average accuracy mAP is improved by 5.2% compared with the most advanced method.

TABLE 2

Comparison of baseline (Supervised) with the state of the art in the Duke-MTMC-reiD dataset shows the best performance for the full annotation. Baseline (ONE-EXAMPLE) shows the initial model trained with the single annotation approach. TJOF represents THE same JOINT OBJECTIVE FUNCTION [ Wu Y, Lin Y, Dong X, et al.progressive Learning for person Re-Identification with One Example [ J ]. IEEE Transactions on Imageprocessing,2019:1-1 ] is THE latest method for image data set annotation research by Wu et al. Ours is our method.

The performance of the method in the data set Duke-MTMC-reiD is still excellent, and in various evaluation indexes which are commonly used, the method is improved to a different extent compared with the most advanced method at present. The specific percentage improvement can be found in table 2. The effectiveness of our method was verified in the superior performance of both data sets.

Using comparison of raw data and migrated data

As the experimental result of the method is trained without using migration data, the iteration effect brought by initializing the model on the original data set by using the K-time expanded single-label setting is much higher than that of the latest method, and the hit rate and the average precision mAP of the method are respectively 3.1% and 4%. In order to better reflect the excellent performance of a migration data set in training, a comparison experiment is performed on two image data sets, namely Market-1501 and DukeMTMC-reiD, an original data set and the migration data set are respectively set to be respectively subjected to the training experiment, and a single-label setting of data expansion is still adopted in the aspect of initialization.

TABLE 3

Table 3 lists the current most advanced method compared to our method. Ours (W.O/T) shows the result of our method without using migration data in training. Ours (W/T) represents the results of our method in training using migration data

The results from table 3 show that if only the data style migration dataset plus the single annotation data of the original dataset is used during initialization, the results will be higher than with the current state-of-the-art methods. The results in our own two control experiments can be seen: experimental results on both data sets showed 5.6%, 3.8% and 1.2%, 0.7% improvement in Rank-1 hit and mean accuracy maps results over the non-use migration data.

After the single marking method is used for initialization, the effect is improved more obviously when the migration data is used for training. The reasons for this good result are mainly two: firstly, the data volume of K times of the original single label is realized by utilizing the camera style migration during initialization, a good initialization CNN model is provided, and the optimization of the subsequent iteration process is realized, because a good initialization can show stronger performance when unmarked data is utilized; secondly, after the migration data is put into training, each pedestrian has pictures which can be used for training under all different camera styles, so that the problem of overfitting caused by single camera shooting scene and small representative training data amount is solved to a certain extent, and the training effect is greatly increased.

FIGS. 4(a) -4 (b) are the DukeMTMC-reiD model initialized with our single notation scheme, and Ours (W.O/T) indicates that no camera style migration data is added during training. Ours (W/T) indicates that migration data was added during training. The abscissa represents the percentage of the selected data to the entire unmarked data. Each solid point represents an iteration result.

First, from the set of control experiments of fig. 4(a) to 4(b), we can clearly conclude that: introducing migration data during training may be more robust than models trained using only raw data set data. Secondly, it can be clearly seen that with the addition of higher and higher proportion of unlabeled data to training, the rank-1 hit rate and the average accuracy mAP are increased to a certain extent, and the trend of slow increase and even decline is generated. This is because as the unlabeled data is gradually increased in the iterative training process, there are fewer and fewer reliable pseudo-labeled data available for selection in the later iterative training process, so that more noise is introduced in the training process, which may cause a certain damage to the robustness of the model.

To further consolidate our results, we list the more detailed data in fig. 5(a) -5 (d).

Precision, Recall, represents the prediction accuracy and Recall of the selected pseudotag candidate set. Fig. 5(a) to 5(b) show the results of the experiment on the data set Market-1501, and fig. 5(c) to 5(d) show the results of the experiment on the data set DukeMTMC-reID.

Compared with the same single labeling experiment, the method has the advantages of being much better than other methods in performance even if the method is not trained by using migration data. Compared with the method of Wu Y, Lin Y, Dong X, equivalent, progressive Learning for Person Re-Identification with One instance Example [ J ]. IEEEtransactions on Image Processing,2019:1-1, the method achieves about 89.33% higher precision than 70% of the method, achieves 45.6% at the last iteration on the recall rate, and is also higher than any other single-labeled method. When the migration data is used for training, almost all performances are superior to those of the performances on the original data set, and the result shows that the problem that a training picture under a single camera is over-fitted can be well solved by utilizing the style migration data, and the problem that a pedestrian is difficult to recognize under different cameras is solved to a certain degree.

Regarding the effect of the amplification factor p on the experiment.

TABLE 4 p as expansion factor

It can be seen from table 4 that the smaller the scale-up factor p, the better the experimental results. The smaller magnification factor leads to a great improvement in the stability of the model in the iterative training of the model, but requires a longer time and more iterative steps.

In the study of single-label pedestrian re-identification, only one picture of one pedestrian is marked, so that the initialized model has low performance. The data volume is changed to be K times of that of an original data set through the camera style migration, and the expanded data is subjected to single-label setting, so that a label sample used for initialization is also changed to be K times of the original data set. Such an initialization model is more robust. Migration data is used in training, the result is better than that of only using an original data set, and the risk that overfitting is easily caused when a single camera shoots pictures is reduced. The experimental results prove the effectiveness of the method: the data set is augmented by the migration of the camera style inside the data set, so that a single-label experiment and self-walking learning scheme can obtain better performance.

The present disclosure is primarily directed to single sample pedestrian re-identification. By single sample annotation is meant that each pedestrian in the dataset has only one annotated sample and a plurality of unlabeled samples. The specific method is that a small amount of labeled data is used for training a CNN model, then the model obtained through training is used for predicting labels (pseudo labels) for unlabeled data, and finally the model is retrained by using the pseudo label data obtained through prediction and the original small amount of label data. However, because the angles, the color differences, the backgrounds and the like of the pictures under different cameras are different, the pictures of the same pedestrian are very different. When only one sample is marked, the learning of crossing the camera is lacked by the picture, and the recognition efficiency is not high. A new single-sample labeling scheme is provided, namely, interconversion of different camera image styles is carried out through a cycleGAN, so that the purpose that each pedestrian crossing the camera has at least one labeled image in each camera style is achieved. Each pedestrian has a marked picture under all different camera styles, the purpose of doing so is to effectively solve the problem that the picture marked by a single sample is not representative, and then a follow-up high-efficiency iterative process is realized by utilizing a self-learning framework. Our approach compares to the latest technology: in a data set Market-1501, the hit rate of the rank-1 is improved to 64.5 percent from the original 55.8 percent, the hit rates of the rank-5, the rank-10 and the rank-20 are respectively improved by 6.2 percent, 5.2 percent and 4.4 percent, and the average precision mAP is improved to 31.4 percent from the original 26.2 percent; our approach in the data set Duke-MTMC-reID also improved 6.3%, 4.1%, 3.4%, 2.8% and 1.5% in these several evaluation indexes, respectively, and we proved through experiments that: through the interconversion of different styles of pictures under different cameras in one data set, the problem of low recognition rate caused by lack of marked pedestrian pictures across the cameras is solved while the data set is effectively expanded.

The present disclosure focuses on single sample learning, which has both advantages, and can fully utilize data distribution information and category labels of unlabeled samples, thereby automatically utilizing the unlabeled samples to improve learning performance.

The present disclosure proposes cross-camera similarity mining during model training and finding trusted pictures. The specific method comprises the following steps: 1) style conversion of different camera images is carried out in the same data set by using CycleGAN, and which not only increases the data quantity of the labeled pictures, but also enables the labeled training pictures to be expanded from the original mode that only one camera is in the style to the mode that a plurality of cameras are in the style.

In the present disclosure, we use GAN network to realize data expansion and cross-camera labeling.

The second embodiment also provides a body left and right limb posture tracking and distinguishing system based on the face orientation;

In a third embodiment, the present embodiment further provides an electronic device, which includes a memory, a processor, and computer instructions stored in the memory and executed on the processor, where the computer instructions, when executed by the processor, implement the method in the first embodiment.

In a fourth embodiment, the present embodiment further provides a computer-readable storage medium for storing computer instructions, and the computer instructions, when executed by a processor, implement the method of the first embodiment.

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. The pedestrian re-identification method based on camera style migration and single marking is characterized by comprising the following steps of:

2. The method of claim 1, wherein the pre-trained cycleGAN network, the step of training comprising:

s31: constructing a cycleGAN network;

s32: constructing a training set;

3. The method of claim 2, wherein the training set comprises: a Market-1501 data set or a DukeMTMC-reiD data set.

4. The method of claim 1, wherein the pre-trained CNN network, the step of training comprising:

5. The method as claimed in claim 4, wherein before the step of performing secondary training on the CNN network after the primary training by using the true label data set in S42, the method further comprises:

6. The method as claimed in claim 4, wherein before the step of performing the three-stage training on the secondary trained CNN network, using a plurality of unlabeled images to be subjected to pedestrian re-identification, the method further comprises:

7. The method according to claim 4, wherein in the step S43, the discrimination of the reliability of the pseudo label comprises the following specific steps:

8. Pedestrian re-identification system based on camera style migration and single marking is characterized by including:

9. An electronic device comprising a memory and a processor and computer instructions stored on the memory and executable on the processor, the computer instructions when executed by the processor performing the method of any of claims 1-7.

10. A computer-readable storage medium storing computer instructions which, when executed by a processor, perform the method of any one of claims 1 to 7.