CN113936246A

CN113936246A - Unsupervised target person re-identification method based on discriminative learning of joint local features

Info

Publication number: CN113936246A
Application number: CN202111076953.2A
Authority: CN
Inventors: 田月媛; 付苗苗; 邓苗磊; 张德贤; 吴雨露
Original assignee: Henan University of Technology
Current assignee: Henan University of Technology
Priority date: 2021-09-14
Filing date: 2021-09-14
Publication date: 2022-01-14

Abstract

The invention discloses an unsupervised target pedestrian re-identification method based on Joint Local Feature discriminant learning, which is characterized in that a Joint Local Feature Extraction Network (JLFEN) formed by parallel space converters and a plurality of simple convolutional neural networks horizontally and dynamically divides two groups of Local regions with different quantity scales for the same pedestrian and extracts effective features, so that the Local features are effectively aligned in space; a Feature Joint Discrimination (FJD) loss function improvement model consisting of a Local Feature Discrimination (LFD) loss function and a Cascade Feature Discrimination (CFD) loss function is adopted to perform discriminant learning on unsupervised Local features, so that the influence of different pedestrians with similar shapes on the Local Feature learning is reduced.

Description

Unsupervised target pedestrian re-identification method based on joint local feature discriminant learning

Technical Field

The invention relates to an unsupervised target pedestrian re-identification method based on joint local feature discriminant learning, and belongs to the field of computer vision.

Background

In the pedestrian re-identification research based on deep learning, the quality of local features also has certain influence on the learning of the non-tag data features of the pedestrian re-identification model. In order to learn more effective local features, an unsupervised target pedestrian re-identification method based on combined local feature discriminant learning is provided. The pedestrian features are horizontally and dynamically divided through the joint local feature extraction network, and the corresponding region features are extracted to obtain two local feature groups, so that the influence of pedestrian attitude change and camera angle on local feature alignment is reduced. And guiding the local features to carry out discriminant learning by adopting a local feature discriminant loss function so as to improve the learning capability of the unsupervised pedestrian re-recognition model on the local features. In order to reduce the influence of local features with similar appearances of different pedestrians on model learning, the relative distance and the absolute distance between the features are calculated by using a cascade feature discrimination loss function, the similar features are drawn close, different features are pushed away, and the recognition performance of the unsupervised pedestrian re-recognition model is further enhanced.

Disclosure of Invention

In order to solve some problems in the background art, the invention provides an unsupervised target pedestrian re-identification method based on joint local feature discriminant learning.

1. An unsupervised target pedestrian re-identification method based on joint local feature discriminant learning is characterized by comprising the following steps of: combining local feature extraction networks;

in order to make the network dependent on the pedestrian gesture in the imageThe local features of different areas are extracted through the change of the state angle, the problem that the local features cannot be well aligned is solved, two parallel space converters are added behind a ResNet50 network to divide the local areas, a simple convolutional neural network is used for extracting the features, the intermediate feature mapping of the image is sent into a plurality of positioning networks, the spatial transformation of the feature mapping is carried out, and compared with the original image, the network computing complexity is reduced; the positioning network consists of a convolution layer with the kernel size of 3 multiplied by 3 and two full connection layers, ReLU is used as the network activation function, the last full connection layer is initialized and biased, and in order to obtain local sampling grids of two division modes, the positioning network is utilized to respectively predict two groups of space position parameters theta ═ theta₁，θ₂，...，θ_MEta ═ eta₁，η₂，...，η_N}；

In order to enable the predicted space position to obtain effective fine-grained characteristics and align the characteristics in the space, position parameters in two space converters are predicted according to each part of the human body in the vertical direction; the human body can be divided into three parts, namely a head part, an upper body and a lower body, wherein the human body is generally short in the upper body and long in the lower body, the head part accounts for the least, the upper body is arranged, and then the lower body is arranged, but under the camera, due to the change of the angle of the camera and the change of the posture of the pedestrian, the proportion of the pedestrian in the obtained image is changed, and if the problem that the upper body is long and the lower body is short occurs, the partial areas divided by the same pedestrian can not be aligned;

the positioning network firstly predicts three groups of space position parameters to divide pedestrians in unequal proportion in the horizontal direction to obtain three local regions with the longitudinal width ratio close to 1:2:3, obtains local features from top to bottom, enables the proportion of heads to be minimum, enables the middle part to be close to the size of the upper body of a normal visual angle, and finally contains feature information below the buttocks of the pedestrians, can divide different local regions for the pedestrians in an image space according to the specific change condition of the pedestrians in the image, and can deal with the problem that the local features cannot be aligned due to the change of the proportion, the posture and the like of the pedestrians;

in addition, the upper body part and the lower body part of the pedestrian compriseThe pedestrian information is more detailed and can be divided into parts such as chests, abdomens, thighs, crus, feet and the like, different parts of the pedestrian information possibly contain different feature information, and the feature information obtained by fine local area division can enable a model to better extract fine-grained features in an image, so that six local areas are divided from the image by considering the fact that six groups of spatial position parameters are pre-measured according to a plurality of parts contained in a human body, the more effective fine-grained feature information of the pedestrian local area can be conveniently mined in a network mode, the robustness and the recognition accuracy of the model can be improved by combining the two types of local feature information, the position parameters are subjected to affine transformation with the size of 2 x 3, and the local areas are obtained by cutting the feature mapping;

wherein A is_θ，A_ηRespectively representing unknown parameters in the two groups of positioning networks, and locally cutting the image through prediction parameters a, b, c and d;

according to the predicted parameters, the feature mapping is cut to divide local sampling grids with different positions and scales according to the spatial position of the pedestrian in the image, and the generation process is

Wherein, for each spatial location parameterized by the positioning network,

representing the spatial location coordinates of the input,

representing the spatial location coordinates of the output; finally, two groups of local sampling grid parameters of different division modes are obtained, a sampler is used for sampling, three local areas and six local areas with the longitudinal width ratio close to 1:2:3 are respectively obtained in the same image according to the above formula, and different local areas comprise different parts of pedestriansFinally, the obtained local areas are respectively sent into a simple convolutional neural network to be coded to obtain local characteristics; the convolutional neural network is composed of an adaptive average pooling function, a convolutional layer, a BN layer and two full-connection layers, wherein the adaptive average pooling function is used for ensuring that a feature region input into the convolutional layer is local feature mapping with a specific size of 2048 multiplied by 1, then feature extraction is respectively carried out on the local region through the convolutional layer, the BN layer and the two full-connection layers, and meanwhile, the feature information of local image splicing can be obtained by connecting the local feature information, so that a model obtains the overall important information of pedestrians in an image, and the problem of inaccurate matching caused by the similarity of local features of different pedestrians is reduced.

2. An unsupervised target pedestrian re-identification method based on joint local feature discriminant learning is characterized by comprising the following steps of: judging and learning local features;

by combining the division of two local areas in the local feature extraction network, local features of different positions and scales can be extracted from the same image, and then the local features are compared in the non-label local features, which is very difficult to process in deep learning based on small batch optimization, so that the feature memory is adopted to store and update the unsupervised sample features; judging local features of each block of input image according to Euclidean distance metric criterion

The similarity between the features of the similar positions with other images is subjected to local feature learning,

representing the m local feature of the ith image; feature memory

The updating method is to use the similarity of sample features as the auxiliary clustering of the monitoring information, train the sample features to find out the similar features nearest to the sample features, and judge whether the classes of the pseudo labels are consistent to perform corresponding operations on the featuresUpdating; the dynamic updating process comprises the following steps:

wherein

Is composed of

The rate of the update is 0.1,

for the updated latest local feature, P is the training period, and when P is 0,

initializing a feature library for an unlabeled database prior to training, and updating features in memory

Infinite proximity to

We compute each local feature

Finding distances from Euclidean distances between sample features in feature memory

The most recent K local features are obtained

Set and then calculate the K local features and features

The sum of the similarities between them, and

and all mth local features in the feature memory to obtain a Local Feature Discrimination (LFD) loss function as:

wherein M represents the number of local features into which the image is divided, here 3 and 6, | · |. the luminance₂Representing the euclidean distance.

3. An unsupervised target pedestrian re-identification method based on joint local feature discriminant learning is characterized by comprising the following steps of: judging and learning cascade features;

under the condition that data does not have class labels, when local features which are similar in appearance but contain different identity information are clustered, the problems that the local features of the same pedestrian are separated easily and the local features of different pedestrians are drawn close easily occur in the extracted local features, so that the local features of the same pedestrian cannot be registered and the learning capability of the local features of the model is reduced, therefore, in order to improve the feature robustness of a model learning sample, a cascade feature discrimination loss function (CFD) optimization model is adopted, the sum of all local feature information of an image is obtained by connecting the local features output by unlabeled images, the discrimination of the local cascade features is learned by maximizing the inter-class distance and minimizing the intra-class distance, and learning the feature of the sample by using the hardest positive sample and the hardest negative sample of the sample is beneficial to enhancing the feature learning capability of the model, the robustness and the accuracy of the model are improved, so that a corresponding quaternary loss function is provided to guide the learning of the cascade characteristics, and the most difficult positive sample and the most difficult negative sample are used in the cascade characteristic discrimination loss function, so that the learning of the characteristics by the model is improved, and therefore, the most difficult positive sample and the most difficult negative sample pair of the samples need to be found in different modes;

first, a small sample batch is given

For the input image X_iPerforming a series of simple random transformations including image cropping, contrast, saturation and brightness to obtain pseudo-positive sample X by image processing_piThe marked identity marks and the input images are used as the most difficult positive samples, all the image samples are sent to a network for experiment, and the randomly generated pseudo positive samples are favorable for feature discrimination learning of an unsupervised model; then, if they are not nearest neighbors, the identities of the samples are not similar and do not belong to the same class, whether the samples are nearest neighbors can be determined through the similarity between the samples, and the hardest negative sample pair is determined by using the cyclic ordering similarity result, so that the Euclidean distance between every two samples is measured according to the Euclidean distance to obtain each sample X_iGenerates an ordered list N of the measurement results_iSorting by measurement results, if sample X_jThe farther the distance is from the sample X_iThe lower the similarity of (A) is, not X_iSo that X can be identified as the nearest neighbor top-n_jIs X_iIn order to mine the most difficult negative pairs, the first two negative x samples in the ranking list are selected_miAnd x_niAs the most difficult negative sample pair, where x_miRank at x_niBefore; and finally, performing feature learning on the model through the obtained four samples, wherein a Cascade Feature Discrimination (CFD) loss function of the model is expressed as:

L_CFD＝(||x_i-x_pi||₂-||x_ai-x_mi||₂)+α)₊+(||x_ai-x_pi||₂-||x_mi-x_ni||₂+β)₊

wherein x_aiRepresenting an input image, x_piRepresenting pseudo-positive samples, x_mi，x_niRespectively represent the hardest negative sample pair, ()₊The expression takes the maximum value, and the parameters alpha and beta are threshold values.

4. An unsupervised target pedestrian re-identification method based on joint local feature discriminant learning is characterized by comprising the following steps of: feature association discrimination learning;

the local feature discrimination loss function mainly performs discrimination learning on each local feature in the image, and the splicing feature discrimination loss function mainly performs the learning on all local feature discriminativity of each image, so that the loss functions of two features are combined in a reasonable mode to obtain the cascade feature discrimination loss function, the learning capability of the pedestrian re-identification model on the unlabeled data features can be improved, and the Feature Joint Discrimination (FJD) loss function is expressed as:

where λ represents a weight.

Drawings

Figure 1 parameter K experimental results in different data sets.

Figure 2 parameter n experimental results in different data sets.

FIG. 3 shows the result of taking the value of parameter λ in data set Market-1501.

FIG. 4 shows the result of taking the value of parameter λ in the data set DukeMTMC-reiD.

FIG. 5 is a method structure model.

Detailed Description

The invention comprises the following technical scheme:

in order to enable the network to extract local features of different areas according to the change of the posture angle of pedestrians in the image and reduce the problem that the local features cannot be well aligned, two parallel space converters are added behind the ResNet50 network to divide the local areas, a simple convolutional neural network is used for feature extraction, the intermediate feature mapping of the image is sent into a plurality of positioning networks, and compared with the original image, the spatial transformation of the feature mapping reduces the network calculation complexity; the core size of the positioning network is 3 x 3 volumeThe method comprises the steps that a layer is formed by stacking and two full-connection layers, ReLU is used as a network activation function, the last full-connection layer is initialized and biased, and in order to obtain local sampling grids of two division modes, a positioning network is used for predicting two groups of space position parameters theta ═ respectively₁，θ₂，...，θ_MEta ═ eta₁，η₂，...，η_N}；

in addition, the upper body part and the lower body part of the pedestrian contain more detailed pedestrian information which can be divided into parts such as the chest, the abdomen, the thighs, the calves, the feet and the like, different parts of the pedestrian can contain different feature information, and the fine-grained features in the image can be better extracted by a model through the feature information obtained by fine local area divisionThe fine-granularity feature information is combined with two kinds of local feature information, so that the robustness and the recognition accuracy of the model are improved, the position parameters adopt affine transformation with the size of 2 multiplied by 3, and a local area is obtained by cutting the feature mapping;

Wherein, for each spatial location parameterized by the positioning network,

representing the spatial location coordinates of the input,

representing the spatial location coordinates of the output; finally, two groups of local sampling grid parameters of different division modes are obtained, a sampler is used for sampling, three local areas and six local areas with the longitudinal width ratio close to 1:2:3 are respectively obtained in the same image, and because different local areas contain characteristic information of different parts of pedestrians, the obtained local areas are respectively sent into a simple convolutional neural network to be coded to obtain local characteristics; the convolutional neural network is composed of an adaptive average pooling function, a convolutional layer, a BN layer and two full-connection layers, wherein the adaptive average pooling function is used for ensuring that a feature region input into the convolutional layer is a local feature mapping with a specific size of 2048 multiplied by 1, and then the local region is divided through the convolutional layer, the BN layer and the two full-connection layersAnd feature extraction is carried out, and meanwhile, the feature information of local image splicing can be obtained by connecting the local feature information, so that the model obtains the overall important information of pedestrians in the image, and the problem of inaccurate matching caused by the similarity of local features of different pedestrians is reduced.

representing the m local feature of the ith image; feature memory

The updating method is characterized in that the similarity of sample characteristics is used as monitoring information for assisting clustering, the sample characteristics are trained to find out the nearest similar characteristics, and whether the categories of pseudo labels are consistent or not is judged to correspondingly update the characteristics; the dynamic updating process comprises the following steps:

wherein

Is composed of

The rate of the update is 0.1,

Infinite proximity to

We compute each local feature

The most recent K local features are obtained

Set and then calculate the K local features and features

The sum of the similarities between them, and

first, a small sample batch is given

For the input image X_iPerforming a series of simple random transformations including image cropping, contrast, saturation and brightness to obtain pseudo-positive sample X by image processing_piThe same as the marked identity label and the input image are used as the most difficult positive samples, all the image samples are sent to the network for experiment, and the marked identity label and the input image are usedThe machine-generated false positive sample is beneficial to feature discriminant learning of an unsupervised model; then, if they are not nearest neighbors, the identities of the samples are not similar and do not belong to the same class, whether the samples are nearest neighbors can be determined through the similarity between the samples, and the hardest negative sample pair is determined by using the cyclic ordering similarity result, so that the Euclidean distance between every two samples is measured according to the Euclidean distance to obtain each sample X_iGenerates an ordered list N of the measurement results_iSorting by measurement results, if sample X_jThe farther the distance is from the sample X_iThe lower the similarity of (A) is, not X_iSo that X can be identified as the nearest neighbor top-n_jIs X_iIn order to mine the most difficult negative pairs, the first two negative x samples in the ranking list are selected_miAnd x_niAs the most difficult negative sample pair, where x_miRank at x_niBefore; and finally, performing feature learning on the model through the obtained four samples, wherein a Cascade Feature Discrimination (CFD) loss function of the model is expressed as:

L_CFD＝(||x_i-x_pi||₂-||x_ai-x_mi||₂+α)₊+(||x_ai-x_pi||₂-||x_mi-x_ni||₂+β)₊

where λ represents a weight.

Results and analysis of the experiments

In order to verify the influence of a combined local feature extraction network on the alignment of the local features of the pedestrians, the combined local feature extraction network is verified on a DukeMTMC-reiD data set and a Market-1501 data set respectively, performance evaluation is carried out through evaluation index average precision value mAP and matching rate top-k, when two local feature branches are combined, the mAP and top-1 values of the model are superior to those of the single local feature branch, and the combined local feature discriminant learning method is used for extracting the local features of different scales and positions of the same pedestrian image, so that the network can more finely acquire effective local feature information of the pedestrians, the accuracy of the local features in alignment comparison is improved, and the learning capacity of the unsupervised pedestrian re-identification model on the local features is further improved. The results of the experiments are shown in the following table.

TABLE 1 local feature Branch in Market-1501 affects results on model Performance

TABLE 2 local feature Branch in DukeMTMC-reiD impact results on model Performance

Compared with the PAUL unsupervised local feature method, the combined local feature branch is used and is divided according to the structure of the human body part, so that local features containing more effective feature information can be obtained for effective local feature comparison. It is also contemplated to learn local features using the FJD loss function with better discriminant learning capabilities. It can be seen from the table that in both datasets, the mAP and top-1 results are higher than PAUL, in the Market-1501 dataset, the mAP value is increased to 41.4%, and the top-1 value is increased to 70.2%; in the DukeMTMC-reiD dataset, the mAP value increased to 54.1% and the top-1 value increased to 73.9%.

Table 3 comparison with the latest method in Market-1501 data set

TABLE 4 comparison of the DukeMTMC-reiD dataset with the latest method

In order to better analyze the influence of a Local Feature Discrimination (LFD) loss function and a Cascade Feature Discrimination (CFD) loss function in a model on the discriminability of the learning local features of the model, the result of a pre-training network JLFEN is taken as the reference of an experiment, the experiment analysis is respectively carried out in a typical data set Market-1501 and a DukeMTMC-reiD, the experiment result shows that the values of an evaluation index average precision value mAP and a matching rate top-k are obviously improved after the LFD loss function is used for learning, and the values of mAP and top-k are also improved after the CFD loss function is used, so that the effective discrimination of the CFD loss function on the local features of different pedestrians with similar appearances in a pedestrian image is shown. Finally, two loss functions are reasonably combined, and the results of the mAP and the top-k are obviously superior to those of the other two combination modes, wherein the mAP value is improved by 8.7%, the top-1 value is improved by 8.3%, the top-5 value is improved by 6.3%, and the top-15 value is improved by 5.5% in the Market-1501 data set; the mAP value in the Duke MTMC-reiD data set is improved by 7.2%, the top-1 value is improved by 7.5%, the top-5 value is improved by 2.8%, and the top-15 value is improved by 1.7%.

TABLE 5 ablation test results of loss function in Market-1501 data set

TABLE 6 results of ablation experiments with loss function in DukeMTMC-reiD data set

The values of the parameters in the loss function also have a certain influence on the performance of the model. Through experiments in different data sets, the values of the weight parameter lambda in the FJD loss function, the selection of the parameter K in the LFD loss function and the selection of the parameter n in the CFD loss function are analyzed, and the influence of the selection of the parameter K in the LFD loss function and the influence of the selection of the parameter n in the CFD loss function on the model performance are respectively analyzed. The experimental method controls other parameters to be unchanged, the values of the other parameters are respectively explored, and the first hit rate top-1 is used as an evaluation index for analysis. It was found that when K is 15, the top-1 value is relatively good.

The CFD loss function also plays an important role in distinguishing the model learning characteristics, wherein the better most difficult negative sample is found, so that the model learning characteristics can be effectively improved, the unsupervised model robustness is enhanced, and therefore, the influence of the selection of the parameter n in the nearest neighbor top-n in the sample on the cascade characteristic distinguishing loss function is analyzed through experiments. Because two negative samples are needed in CFD loss function learning, when the n value is too small, the obtained samples are insufficient, the loss function learns the positive sample as the negative sample by mistake, so that the learning performance of the model is reduced, when the n value is gradually increased, the performance of the model is increased and then reduced, the larger the n value is, because the learned samples are enough, the model can easily find the most difficult sample, the learning difficulty of the model is reduced, and when n is 6, the model can obtain good performance.

From the above results, the parameter K is 15, n is 6 to control the variables, and the parameter λ is analyzed experimentally, and it can be seen in fig. 3 that in different data sets, as the λ value is gradually increased, the results of the average precision value mAP and the first hit rate top-1 are increased and then decreased, wherein at λ 2, the modeling performance is the best.

Claims

in order to enable the network to extract local features of different areas according to the change of the posture angle of pedestrians in the image and reduce the problem that the local features cannot be well aligned, two parallel space converters are added behind the ResNet50 network to divide the local areas, a simple convolutional neural network is used for feature extraction, the intermediate feature mapping of the image is sent into a plurality of positioning networks, and compared with the original image, the spatial transformation of the feature mapping reduces the network calculation complexity; the positioning network consists of a convolution layer with the kernel size of 3 multiplied by 3 and two full-connection layers, ReLU is used as the network activation function, the last full-connection layer is initialized and biased, and in order to obtain local sampling grids of two division modes, the positioning network is utilized to respectively predict two groups of space position parameters theta ═ theta₁，θ₂，...，θ_MEta ═ eta₁，η₂，...，η_N}；

In order to enable the predicted space position to obtain effective fine-grained characteristics and align the characteristics in the space, position parameters in two space converters are predicted according to each part of the human body in the vertical direction; the human body can be divided into three parts, namely a head part, an upper body and a lower body, wherein the human body is generally short in the upper body and long in the lower body, the head part is the smallest, then the upper body and then the lower body, but under the camera, due to the change of the angle of the camera and the change of the posture of a pedestrian, the proportion of the pedestrian in the obtained image changes, for example, the problem that the upper body is long and the lower body is short occurs, and the situation that the partial areas divided by the same pedestrian cannot be aligned can occur;

the positioning network firstly predicts three groups of space position parameters to divide pedestrians in unequal proportion in the horizontal direction to obtain three local regions with the longitudinal width ratio close to 1:2:3, obtains local features from top to bottom to enable the proportion of heads to be minimum, the middle part of the local regions is close to the size of the upper body at a normal visual angle, and finally contains feature information below the buttocks of the pedestrians;

in addition, the upper body part and the lower body part of the pedestrian contain more detailed pedestrian information which can be divided into parts such as chests, abdomens, thighs, shanks, feet and the like, different parts of the pedestrian information can contain different feature information, and the fine particle size features in the image can be better extracted by the model through the feature information obtained by fine local area division, so that six local areas are divided by the image by considering prediction of six groups of spatial position parameters according to a plurality of parts contained in a human body, the network can conveniently mine more effective fine particle size feature information of the local area of the pedestrian, the robustness and the recognition accuracy of the model can be improved by combining two types of local feature information, the position parameters adopt affine transformation with the size of 2 x 3, and the local areas are obtained by cutting the feature mapping;

Wherein, for each spatial position parameterized by the positioning network,

representing the spatial location coordinates of the input,

representing the spatial location coordinates of the output; finally, two groups of local sampling grid parameters of different division modes are obtained, a sampler is used for sampling, three local areas and six local areas with the longitudinal width ratio close to 1:2:3 are respectively obtained in the same image, and because different local areas contain characteristic information of different parts of pedestrians, the obtained local areas are respectively sent into a simple convolutional neural network to be coded to obtain local characteristics; the convolutional neural network consists of an adaptive average pooling function, a convolutional layer, a BN layer and two full-connection layers, wherein the adaptive average pooling function is used for ensuring that a feature region input into the convolutional layer is local feature mapping with a specific size of 2048 multiplied by 1, then feature extraction is respectively carried out on the local region through the convolutional layer, the BN layer and the two full-connection layers, and meanwhile, the local feature information is connected to obtain feature information of image local splicing, so that a model obtains integral important information of pedestrians in an image, and the problem of inaccurate matching caused by the similarity of local features of different pedestrians is solved.

2. An unsupervised target pedestrian re-identification method based on joint local feature discriminant learning is characterized by comprising the following steps of: local feature discrimination learning;

by combining the division of two local areas in a local feature extraction network, local features of different positions and scales can be extracted from the same image, and then the local features are compared in the non-label local features and are difficult to process in deep learning based on small-batch optimization, so that a feature memory is adopted to store and update the unsupervised sample features; judging local features of each block of input image according to Euclidean distance measurement criterion

indicates the ith sheetThe mth local feature of the image; feature memory

wherein

Is composed of

The rate of the update is 0.1,

Infinite proximity to

We compute each local feature

The most recent K local features are obtained

Set and then calculate the K local features and features

The sum of the similarities between them, and

3. An unsupervised target pedestrian re-identification method based on joint local feature discriminant learning is characterized by comprising the following steps of: judging and learning cascade characteristics;

under the condition that data does not have class labels, when local features which are similar in appearance but contain different identity information are clustered, the problems that the local features of the same pedestrian are separated easily and the local features of different pedestrians are drawn close easily occur in the extracted local features, so that the local features of the same pedestrian cannot be registered and the learning capacity of the local features of the model is reduced, therefore, in order to improve the feature robustness of a model learning sample, a cascade feature discrimination loss function (CFD) optimization model is adopted, the sum of all local feature information of an image is obtained by connecting the local features output by unlabeled images, the discrimination of the local cascade features is learned by maximizing the inter-class distance and minimizing the intra-class distance, the learning of the sample features by using the hardest positive sample and the hardest negative sample of the sample is beneficial to enhancing the feature learning capacity of the model and improving the robustness and accuracy of the model, therefore, a corresponding quadruple loss function is provided to guide the learning of the cascade characteristics, and the most difficult positive sample and the most difficult negative sample are used in the cascade characteristic discrimination loss function, so that the learning of the characteristics by the model is improved, and therefore, the most difficult positive sample and the most difficult negative sample of the samples need to be found in different modes;

first, a small sample batch is given

For the input image X_iPerforming a series of simple random transformations including image cropping, contrast, saturation and brightness to obtain pseudo-positive sample X by image processing operation_piThe marked identity marks and the input images are used as the most difficult positive samples, all the image samples are sent to a network for experiment, and the randomly generated pseudo positive samples are favorable for feature discriminant learning of an unsupervised model; then, if they are not nearest neighbors, the identities between the samples are not similar and belong to the same class, whether the samples are nearest neighbors can be determined through the similarity between the samples, and the hardest negative sample pair is determined by using the cyclic ordering similarity result, so that each sample X is obtained by measuring the Euclidean distance between every two samples according to the Euclidean distance_iGenerates an ordered list X of the measurement results_iSorting by measurement results, if sample X_jThe farther the distance is from the sample X_iThe lower the similarity of (A) is, not X_iSo that X can be identified as the nearest neighbor top-n_jIs X_iIn order to mine the most difficult negative pairs, the first two negative x samples in the ranking list are selected_miAnd x_niAs the most difficult negative sample pair, where x_miRank at x_niBefore; finally, the four obtained are passedThe sample carries out feature learning on the model, and the Cascade Feature Discrimination (CFD) loss function of the sample is expressed as:

wherein x_aiRepresenting an input image, x_piRepresenting pseudo-positive samples, x_mi，x_niRespectively represent the hardest negative sample pair, ()₊This means taking the maximum value and the parameters α, β as the threshold.

4. An unsupervised target pedestrian re-identification method based on joint local feature discriminant learning is characterized by comprising the following steps of: feature joint discrimination learning;

where λ represents a weight.