CN116778530A

CN116778530A - Cross-appearance pedestrian re-identification detection method based on generation model

Info

Publication number: CN116778530A
Application number: CN202310898793.2A
Authority: CN
Inventors: 刘蒙蒙; 谢学说; 李涛
Original assignee: Haihe Laboratory Of Advanced Computing And Key Software Xinchuang; Nankai University
Current assignee: Haihe Laboratory Of Advanced Computing And Key Software Xinchuang; Nankai University
Priority date: 2023-07-21
Filing date: 2023-07-21
Publication date: 2023-09-19

Abstract

The invention provides a cross-appearance pedestrian re-recognition detection method based on a generation model, which belongs to the technical field of pedestrian re-recognition and comprises the following steps of: in the generation model, body type characteristics and appearance characteristics are exchanged among different pedestrian images, and a new pedestrian image is generated; the RGB image of the pedestrian image is subjected to a pre-trained edge detection network and a pedestrian semantic segmentation network to obtain a pedestrian profile sketch and a pedestrian analytic graph; the pedestrian contour sketch, the RGB image and the pedestrian analytic graph of the pedestrian image of the pedestrian data set are input into the same backbone network to extract the characteristics, and the characteristics are fused and then the reasoning training is carried out. The method comprises the steps of carrying out appearance dimension enhancement on a pedestrian image by using a generated model, and introducing the generated pedestrian image into a training stage of the model; the three modes use the same backbone network to extract the characteristics, the key characteristics of different appearances of the same pedestrian are guided to be learned by the model through fusion and reasoning, and the model has more robust performance in a cross-appearance scene.

Description

Cross-appearance pedestrian re-identification detection method based on generation model

Technical Field

The invention belongs to the technical field of pedestrian re-recognition, and particularly relates to a cross-appearance pedestrian re-recognition detection method based on a generation model.

Background

At present, the application scene of the computer vision technology is very wide. The face recognition technology is superior to human beings, is widely applied to industries such as industry, medical treatment, education and the like, and the eyes in academic circles and industry gradually turn to a subject with scientific research significance and application value, namely pedestrian re-recognition. In practical scenes such as transportation, industrial manufacturing and the like, most cases are the case of blurring a human face and even not containing the human face, so that the human face recognition effect is very limited.

The Re-identification (Re-ID) aims at giving a monitoring pedestrian image to search other images of the pedestrian under cross equipment, and the pedestrian Re-identification technology can make up for the failure of face recognition and the visual limitation of a fixed camera and is applied to the fields of video monitoring, intelligent security, intelligent life and the like. Because Re-ID needs to find objects in images and videos across devices, the resolution and the positions of the devices are different, the coverage areas of the devices are not overlapped, so that the lack of coherent information is caused, the illumination, the background and the shielding of scenes are different, the posture and the appearance of the objects are changed, and the Re-recognition technology of pedestrians is greatly challenged.

A pedestrian re-recognition method based on deep learning feature learning. Global features: the feature learning is mainly carried out by using global images of the whole body, and common improvement ideas include an Attention mechanism, multi-scale fusion and the like; local features: characteristic learning is carried out by utilizing a local image area, such as a partial structure of a pedestrian or simple vertical area division, and finally, the local characteristics are aggregated into final pedestrian characteristics for recognition; auxiliary characteristics: enhancing the effect of feature learning by using some auxiliary information, such as semantic information, visual angle information, domain information, GAN generated information, data enhancement and the like; specific network design: by utilizing the characteristics of Re-ID tasks, a plurality of network structures with fine granularity, multiscale and the like are designed, so that the Re-ID task is more suitable for the Re-ID scene.

The pedestrian re-recognition method based on the deep learning measurement learning comprises the following steps: mainly comprising the design of different types of loss functions and the improvement of sampling strategies. Identity loss, namely taking a training process of Re-ID as an image classification problem, taking different pictures of the same pedestrian as a category, and commonly having a Softmax cross entropy loss function; verification loss: training Re-ID is taken as an image matching problem, whether the training belongs to the same pedestrian or not is subjected to classification learning, and a common contrast loss function and a classification loss function are adopted; triplet loss: training Re-ID is taken as an image retrieval problem, and the characteristic distance of the same pedestrian picture is smaller than that of different pedestrians, and various improvements thereof are carried out; improvement of training strategy: adaptive sampling and different weight allocation strategies.

The pedestrian re-identification method across the appearance comprises the following steps: along with the release of cross-appearance pedestrian re-identification data sets, researchers have proposed cross-appearance pedestrian re-identification methods applied to these data sets, which mostly forgo clothing-related appearance features. The RF-Reid is an end-to-end model for pedestrian re-identification, and takes a radio frequency track as input, and features are extracted from the track, so that the model can obtain accurate human body contours for identification. In PRCC, researchers use Angle Specific Extractors (ASE) to extract fine-grained angle specific discrimination features by varying the sampling range of the SPT and to contribute a multi-stream network to aggregate multi-granularity features. The BC-Net uses the clothing template to search the candidate pedestrian image by using the double-branch network, so that the biological characteristics and clothing characteristics are effectively fused. CASE-Net utilizes a cross-appearance, antagonistic learning strategy to extract body type features and structure body type through image generation of posture changes. FSAM proposes a dual stream framework that learns fine-grained body-type related information from a body-type stream and transfers it into an appearance stream to supplement garment-independent feature information in appearance features.

Early research of the pedestrian re-recognition model mainly focuses on manually designing features and learning better similarity measurement, and with the development of deep learning, better features and learning better similarity habit measurement are automatically extracted through a neural network, so that the pedestrian re-recognition performance is greatly improved. However, simplification of the application scene in these studies is performed, and the pedestrian's clothing is not changed, assuming that the search is performed only in a short time, so that the pedestrian re-recognition model focuses mostly on the color and texture features of the pedestrian's clothing, which limits the application scene of the model. Due to human and time constraints, existing cross-appearance pedestrian re-identification datasets suffer from the following drawbacks: 1. the presently existing cross-appearance datasets are small in size and insufficient to support training of deep learning models. 2. The bias of the background of the data set and the like is greatly different from the real world, which can cause domain adaptation problems of the model. 3. The time interval of source video acquisition is short, no obvious appearance change exists, and appearance difference is difficult to observe. In a real scene, people can change clothes according to factors such as time weather, and the assumption that appearance of the pedestrian is consistent in a short period is not applicable. The new scene also provides new challenges for the pedestrian re-recognition model, and the model is required to be independent of the appearance characteristics related to clothes, and focuses on robust characteristic information such as hairstyles, body types and the like of people.

Disclosure of Invention

Aiming at the technical problems of poor long-time cross-appearance pedestrian re-recognition performance, large clothing change difference and lack of related data sets in the prior art, the invention provides a cross-appearance pedestrian re-recognition detection method based on a generation model, which uses the generation model to strengthen the appearance dimension of a pedestrian image and introduces the generated pedestrian image into a training stage of the model; meanwhile, a multi-mode pedestrian image is introduced, besides the traditional RGB image, two images with smaller appearance correlation, namely a pedestrian contour sketch and a pedestrian analytic image, are additionally added, and the three modes use the same backbone network to extract features, and are fused for reasoning.

The technical scheme adopted by the invention is as follows: a cross-appearance pedestrian re-identification detection method based on a generated model comprises the following steps:

step 1: inputting pedestrian images of a pedestrian data set into a generation model, respectively obtaining body type characteristics and appearance characteristics of the pedestrian images by different encoders of the generation model, exchanging the body type characteristics and the appearance characteristics among different pedestrian images, generating new pedestrian images through a generator, identifying tag information of the new pedestrian images by matching with a discriminator, and storing the new pedestrian images and the tag information thereof into the pedestrian data set;

step 2: taking RGB images of all pedestrian images of the pedestrian data set as input, respectively extracting contour information and semantic segmentation information of pedestrians through a pre-trained edge detection network and a pre-trained pedestrian semantic segmentation network to obtain a pedestrian contour sketch and a pedestrian analysis drawing, and storing the pedestrian contour sketch and the pedestrian analysis drawing in the pedestrian data set;

step 3: inputting the pedestrian profile sketch, the RGB image and the pedestrian analysis drawing of the pedestrian image of the pedestrian data set into the same backbone network to extract the characteristics to obtain f _Contour 、f _RGB And f _Parsing Feature fusion to obtain f _Conb And then carrying out reasoning training, and carrying out cross-appearance pedestrian re-identification detection on the trained backbone network.

Further, in the step 1, after the pedestrian image is input into the generation model, appearance characteristics and body type characteristics are respectively obtained through an appearance encoder and a body type encoder, the appearance characteristics and the body type characteristics of the same pedestrian image are input into a generator to obtain a self-reconstruction pedestrian image, and a discriminator discriminates the tag information of the self-reconstruction pedestrian image; and the self-reconstruction pedestrian image acquires appearance characteristics and body type characteristics through the appearance encoder and the body type encoder respectively, the appearance characteristics of the self-reconstruction pedestrian image and the body type characteristics of another self-reconstruction pedestrian image are input into the generator to obtain a new pedestrian image, the identifier identifies the label information of the new pedestrian image, and the new pedestrian image and the label information are stored in the pedestrian data set.

Further, in step 1, the appearance encoder is a feature extraction network of the pedestrian re-recognition network; the body type encoder is a network that introduces a pyramid structure of interest to different scales.

Further, when the generated model is trained, the appearance encoder uses an identity loss function to conduct constraint, and the body type encoder uses an identity tag loss function to conduct training of the model.

Further, in step 2, the edge detection network uses an edge detection data set to complete pre-training; the pedestrian semantic segmentation network utilizes the LIP data set to complete pre-training.

Further, in step 3, the f _Contour 、f _RGB And f _Parsing The fusion mode of (a) is weighted splicing.

Further, in step 3, the backbone network adopts a Densenet121.

Further, in step 3, the loss of the backbone network includes the f _Contour 、f _RGB And f _Parsing TriHard loss of (2)F is as follows _Conb Identity loss of->Appearance loss->The formula is as follows:

in the above formula (1), P and K respectively represent the number of pedestrians and the number of corresponding images in one batch, and an alpha interval parameter alpha and a distance measure D are given for the kth image of the kth person in the batch Obtained byK' th image for p-th person in lot +.>Obtained->K 'th image for p' th person in batch +.>Obtained->Restricting its characteristics to the most difficult positive samples +.>And the characteristics of the most difficult negative samples->Is a distance of (2); in the above formulas (2) and (3), the image x is given _i Identity tag y _i And appearance label z _i ，p(y _i |x _i ) And p (z) _i |x _i ) Respectively representing the model to recognize the image as identity tag y _i And appearance label z _i Is a probability distribution of (c).

Compared with the prior art, the invention has the following beneficial effects:

1. according to the invention, the generation model is introduced, on one hand, data enhancement can be performed through the generation model, and on the other hand, the quality of the pedestrian data set is improved by increasing the diversity of pedestrians in the appearance dimension under the condition of a limited pedestrian data set; on the other hand, the appearance encoder and the body type encoder of the generated model are used for realizing separation of two types of features related to appearance and robust to cross-appearance, the characteristics extracted by the body type encoder and the other characteristics are fused, and the adaptability of the model to appearance change is improved, so that the pedestrian re-recognition model with better performance is obtained. The image generated by the generated model can fill the gap of the original data set on the distribution curve, the illumination or contrast of the generated image and the original data set are more similar, the generated image is close to the actual test set, and compared with the combination of multiple data sets, the generated image is more close to the test set, so that the result of the model is more likely to be improved.

2. Because the traditional pedestrian re-recognition model is excessively dependent on appearance characteristics, the invention adopts three pedestrian images with different modes, not only maintains the adaptability of the model to the traditional pedestrian re-recognition task, but also reduces the dependence of the model recognition on the appearance, and the characteristics for reasoning are more suitable for scenes crossing the appearance. According to the invention, the pedestrian re-recognition model is fused with multi-mode information, and the target retrieval across the appearance is improved by considering two appearance independent clues, namely the contour feature and the human body analysis feature, so that the model is guided to learn key features of the same pedestrian with different appearances, and has more robust performance under a scene across the appearances.

Drawings

FIG. 1 is a flow chart of an embodiment of the present invention;

FIG. 2 is a schematic diagram of a generative model according to an embodiment of the present invention;

fig. 3 is a schematic diagram of an image extraction and pedestrian re-recognition model according to an embodiment of the present invention.

Detailed Description

The present invention will be described in detail below with reference to the drawings and the specific embodiments, so that those skilled in the art can better understand the technical solutions of the present invention.

The embodiment of the invention provides a cross-appearance pedestrian re-identification detection method based on a generation model, which is shown in fig. 1 and comprises the following steps:

step 1: the number of images of the pedestrian data set has a great influence on the performance of the deep learning model, and the number of appearances of pedestrians is compared with the posture or the angle according to the arrangement and the related scale of the common pedestrian data set, so that the two appearances at present cannot support the model to be optimized. Other clothing images of the pedestrian are generated by combining the pedestrian with various appearances and the like by using the generation model, so that a large number of appearances of a single pedestrian are formed, and the generation model is guided to learn robust features of the pedestrian.

The characteristics of the cross-appearance pedestrian image are generally divided into two types, one type is the characteristics of the pedestrian, such as height, body shape and the like, the characteristics are irrelevant to the appearance of the pedestrian, such as clothing and the like, and the characteristics are expected characteristic information of the model; the other is the appearance information of the color, texture and the like of the clothing, and the characteristics of the other type occupy a larger proportion in the conventional pedestrian re-recognition. In this embodiment, the pedestrian image of the pedestrian data set is input into the generation model, different encoders of the generation model are used to obtain the body type feature and the appearance feature of the pedestrian image respectively, the body type feature and the appearance feature are exchanged between different pedestrian images, the images of different pedestrians and different appearances of the same pedestrian are exchanged, then a new pedestrian image is generated through the generator, the label information of the new pedestrian image is identified by matching with the identifier, the process achieves the purpose of separating the two features, and finally the new pedestrian image and the label information thereof are stored in the pedestrian data set. The new pedestrian image is generated to enrich the diversity of the pedestrian image, and the quality of the pedestrian data set can be improved by adding the generated pedestrian image into the training stage of the model, so that the capability of the model for processing the cross-appearance problem is improved. As shown in fig. 2, the specific procedure of data enhancement of the pedestrian data set by using the generative model is as follows: after the pedestrian data set is input into the generation model, the appearance characteristic and the body type characteristic are respectively obtained through the appearance encoder and the body type encoder, the appearance characteristic and the body type characteristic of the same pedestrian image are input into the generator to obtain a self-reconstruction pedestrian image, and the identifier identifies the label information of the self-reconstruction pedestrian image; the appearance characteristics and the body type characteristics of the self-reconstruction pedestrian image are acquired through the appearance encoder and the body type encoder respectively, the appearance characteristics of the self-reconstruction pedestrian image and the body type characteristics of another self-reconstruction pedestrian image are input into the generator to obtain a new pedestrian image, the identifier identifies the label information of the new pedestrian image, and the new pedestrian image and the label information thereof are stored in the pedestrian data set.

The body type encoder and the appearance encoder of the generative model are constrained using different loss functions, so that the two encoders pay attention to the characteristics of different parts of the pedestrian image. Based on the related knowledge of the appearance information such as pedestrian clothing and the like, which is focused by the traditional pedestrian re-recognition algorithm, the appearance encoder uses the characteristic extraction network of the common pedestrian re-recognition network, so that the common identity loss function is used for constraint. The body type encoder uses a network which is focused on pyramid structures with different scales, and the traditional pedestrian data set does not contain appearance changes in the pedestrian image, so that the identity tag is consistent with the appearance tag; in the embodiment, the identity tag of the pedestrian is introduced into the training of the generation model, and the training of the model is guided by using the identity tag loss function. The discriminator of the generated model continuously improves the quality of the generated image by judging the true or false of the generated image against the generator.

The quality of the training data is not dependent on the quantity, and the representativeness of the data should be more emphasized so that the model can learn more robust features. The pedestrian images generated by the generated model are derived from the training set, and are different from the training set, so that the pedestrian images can fill in the gaps on the data distribution curve and enrich the appearance dimension of the pedestrian images. In the training images of the selected generation model, the long-time appearance-crossing characteristic is combined, the input of the generation model is adjusted according to the time span, and the pedestrian images with large appearance differences are selected for training and generation, so that obvious appearance differences are realized.

Step 2: the outline of the pedestrian is mainly the edge information of the pedestrian, and the important information of the body type of the pedestrian is included; the human body analysis chart is mainly used for obtaining the information of the local area of the pedestrian and is often used as auxiliary information for pedestrian alignment. The outline information and the human body analytic graph of pedestrians are robust to the appearance change to a certain extent, and the pedestrian information robust to the appearance change can be provided for the network.

The process of extracting the contour of the pedestrian and the human body analytic graph is a priori knowledge extraction stage of the model. Before training of the model, RGB images of all pedestrian images of the pedestrian data set are used as input, the contour information and the semantic segmentation information of the pedestrians are respectively extracted through an edge detection network pre-trained on the edge detection data set and a pedestrian semantic segmentation network pre-trained on the LIP data set to obtain a pedestrian contour sketch and a pedestrian analysis drawing, and all images are stored in the pedestrian data set according to an original path, as shown in fig. 3.

Step 3: as shown in fig. 3, the pedestrian profile sketch, RGB image and pedestrian resolution of the pedestrian image of the pedestrian dataset are input with the same backbone extraction features to obtain f _Contour 、f _RGB And f _Parsing Feature fusion to obtain f _Conb And then carrying out reasoning training, and carrying out cross-appearance pedestrian re-identification detection on the trained backbone network.

The contour information and the human body analysis chart of the pedestrians cannot be completely used as the features of the re-recognition of the pedestrians, so that besides introducing two modes into a re-recognition network of the pedestrians, the RGB images of the pedestrians are still kept as input, global features of the RGB images and features of other two modes are obtained through a backbone network, the features and the features of the RGB images are mutually supervised and mutually supplemented with more details in the RGB images, and finally the three features are subjected to weighted splicing, so that the dependence on the appearance-related features of the pedestrians in the recognition of the pedestrians is effectively reduced, and the precision of a re-recognition model of the pedestrians is improved.

The pedestrian re-recognition process is generally characterized by two stages of feature extraction and distance measurement, and the quality of a feature extraction network directly influences the quality of a pedestrian re-recognition model. The existing work commonly uses a deep network as a backbone network, and the effect of feature extraction has larger difference due to the arrangement of a network structure and the difference of network layers. In this embodiment, the Densenet121 is selected as a backbone network of the network, and experiments are performed according to different depth networks respectively, so that the scale of the network is gradually increased on the basis of guaranteeing the pedestrian recognition speed, and better pedestrian re-recognition performance is obtained. According to experience, there is generally a phenomenon that in the structure of a deep neural network, as the number of layers of the network is deepened, a better result is generally obtained in the model, but a phenomenon that gradient vanishes gradually occurs to significantly reduce the training effect of the model, while the Densenet121 connects the input of each layer with the output of all the previous layers, so as to alleviate the gradient vanishing problem, connect the shallow layer characteristics and the deep layer characteristics of the pedestrian, and achieve a mode of combining the global characteristics and the fine-grain local characteristics, so that the problem that the multi-scale characteristics are needed for the re-identification of the pedestrian is better adaptive to the problem, as shown in table 1.

Table 1 network structure and output size table of Densenet121

The existing pedestrian re-identification model often adopts a TriHard loss function, and a positive sample with the lowest similarity to the anchor point image and a negative sample with the highest similarity are selected from one batch to calculate a ternary loss function. In order to guide the Densenet121 network to learn the data and the characteristics thereof in three different modes together, the embodiment introduces identity loss, appearance loss and TriHard loss, namely f _Contour 、f _RGB And f _Parsing TriHard loss of f) _Conb Identity loss and appearance loss of the product. The 5 losses are adopted to jointly carry out backbone network parameter adjustment, and when parameter adjustment is carried out, the appearance loss needs to be multiplied by a coefficient lambda, 0<λ<1. Identity loss can guide the model to learn the characteristics related to appearance, and identity loss and appearance loss can guide the model to learn the characteristics related to the identity of pedestrians, and TriHard loss enables the model to obtain a more robust effect in classification of difficult samples. Wherein TriHrad loss calculates the similarity of feature vectors in European space, and identity loss and appearance loss calculate vector similarity in cosine space. The formulas for the three loss functions are as follows:

In order to measure the performance of the model for identifying the pedestrian re-recognition based on the cross-appearance of the generated model, in this embodiment, a comparison experiment is performed on the NKUP data set by using the (our) and other SOTA pedestrian re-recognition methods, and the experimental results are shown in table 2. In this embodiment, rank 1 and mAP are used to evaluate the performance of the model, where the former represents the precision of ranking the first pedestrian image in the query list, and the latter is the average precision of all the samples to be queried. The reference model in table 2 is a network model enhanced based on generated model data, and from experimental results, it can be seen that the method provided by the embodiment has significant advantages, especially, the best result is obtained in the scene of recognition of pedestrians across appearances, and the lifting effect of the model in recognition of pedestrians across appearances is verified.

Table 2 comparison of experimental results

The present invention has been described in detail by way of examples, but the description is merely exemplary of the invention and should not be construed as limiting the scope of the invention. The scope of the invention is defined by the claims. In the technical scheme of the invention, or under the inspired by the technical scheme of the invention, similar technical schemes are designed to achieve the technical effects, or equivalent changes and improvements to the application scope are still included in the protection scope of the patent coverage of the invention.

Claims

1. The cross-appearance pedestrian re-identification detection method based on the generated model is characterized by comprising the following steps of:

2. The method for detecting the cross-appearance pedestrian re-recognition based on the generated model as claimed in claim 1, wherein in the step 1, after the pedestrian image is input into the generated model, appearance characteristics and body type characteristics are respectively obtained through an appearance encoder and a body type encoder, the appearance characteristics and the body type characteristics of the same pedestrian image are input into a generator to obtain a re-constructed pedestrian image, and a discriminator discriminates label information of the re-constructed pedestrian image; and the self-reconstruction pedestrian image acquires appearance characteristics and body type characteristics through the appearance encoder and the body type encoder respectively, the appearance characteristics of the self-reconstruction pedestrian image and the body type characteristics of another self-reconstruction pedestrian image are input into the generator to obtain a new pedestrian image, the identifier identifies the label information of the new pedestrian image, and the new pedestrian image and the label information are stored in the pedestrian data set.

3. The method for detecting the cross-appearance pedestrian re-recognition based on the generated model as claimed in claim 2, wherein in the step 1, the appearance encoder is a feature extraction network of a pedestrian re-recognition network; the body type encoder is a network that introduces a pyramid structure of interest to different scales.

4. A cross-appearance pedestrian re-recognition detection method based on a generative model as in claim 3, wherein the generative model is trained by the appearance encoder constrained using an identity loss function, and the body type encoder directs training of the model using an identity tag loss function.

5. The method for detecting the re-recognition of the pedestrian across the appearance based on the generated model as claimed in claim 1, wherein in the step 2, the edge detection network uses an edge detection data set to complete the pre-training; the pedestrian semantic segmentation network utilizes the LIP data set to complete pre-training.

6. The method for detecting cross-appearance pedestrian re-recognition based on a generated model as claimed in claim 1, wherein in step 3, the f _Contour 、f _RGB And f _Parsing The fusion mode of (a) is weighted splicing.

7. The method for detecting the re-recognition of the pedestrian across the appearance based on the generated model as claimed in claim 1, wherein in the step 3, the backbone network adopts a Densenet121.

8. The method for cross-appearance pedestrian re-recognition detection based on a generative model as recited in claim 1 or 7, wherein in step 3, the loss of the backbone network comprises the f _Contour 、f _RGB And f _Parsing TriHard loss of (2)F is as follows _Conb Identity loss of->Appearance loss->The formula is as follows:

in the above formula (1), P and K respectively represent the number of pedestrians and the number of corresponding images in one batch, and an alpha interval parameter alpha and a distance measure D are given for the kth image of the kth person in the batch1.ltoreq.k.ltoreq.K)>K' th image for p-th person in lot +.>1.ltoreq.k'.ltoreq.K)>K 'th image for p' th person in batch +.>1.ltoreq.k'.ltoreq.K)>Constraining its characteristics to the most difficult positive samplesAnd the characteristics of the most difficult negative samples->Is a distance of (2); in the above formulas (2) and (3), the image x is given _i Identity tag y _i And appearance label z _i ，p(y _i |x _i ) And p (z) _i |x _i ) Separate tableThe model recognizes the image as an identity tag y _i And appearance label z _i Is a probability distribution of (c).