CN111967408B

CN111967408B - Low-resolution pedestrian re-identification method and system based on prediction-recovery-identification

Info

Publication number: CN111967408B
Application number: CN202010843411.2A
Authority: CN
Inventors: 王亮; 黄岩; 韩苛; 单彩峰; 纪文峰
Original assignee: Cas Artificial Intelligence Research Qingdao Co ltd
Current assignee: Cas Artificial Intelligence Research Qingdao Co ltd
Priority date: 2020-08-20
Filing date: 2020-08-20
Publication date: 2022-06-21
Anticipated expiration: 2040-08-20
Also published as: CN111967408A

Abstract

The invention discloses a low-resolution pedestrian re-identification method and a system based on 'prediction-recovery-identification', comprising the following steps: inputting the acquired low-resolution picture to be identified into a trained deep neural network model, and performing detail recovery under the optimal scale to obtain a super-resolution picture; and calculating Euclidean distance between the super-resolution picture features and the high-resolution search library picture features, and sequencing identity matching according to the Euclidean distance. The predictor of the deep neural network model can adaptively predict a better scale factor according to the content of the low-resolution picture so as to achieve better recovery and recognition effects.

Description

Low-resolution pedestrian re-identification method and system based on prediction-recovery-identification

Technical Field

The invention relates to the technical field of pattern recognition and machine learning, in particular to a low-resolution pedestrian re-recognition method and system based on 'prediction-recovery-recognition'.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

The goal of pedestrian re-identification is to give a pedestrian query picture (query) under one camera and match the picture of the same pedestrian in the picture library (galery) of another camera. With the wider and wider distribution of the cameras of the street alleys, the pedestrian re-identification shows wide application prospect in the aspects of security protection and the like. For example, scenes such as finding lost children and tracking suspects can be assisted by a pedestrian re-identification technology.

Low resolution pedestrian re-identification refers to a case where the resolution of a pedestrian query picture (query) is low, and the resolution of pictures in a picture library (galery) is high. Many previous methods use super-resolution modules for picture enlargement and detail restoration of low resolution pictures. However, they often preset a fixed scale factor when the image is over-divided, which easily causes problems of insufficient detail recovery, or excessive noise generation, and the like, and it is difficult to ensure that the recovery effect of the image is most suitable for identifying the identity.

Disclosure of Invention

In order to solve the above problems, the invention provides a low-resolution pedestrian re-identification method and system based on 'prediction-recovery-identification', which can adaptively predict a better scale factor according to the content of a low-resolution picture so as to achieve better recovery and identification effects.

In some embodiments, the following technical scheme is adopted:

the low-resolution pedestrian re-identification method based on prediction-recovery-identification comprises the following steps:

inputting the acquired low-resolution picture to be identified into a trained deep neural network model, and performing detail recovery under the optimal scale to obtain a super-resolution picture;

and calculating Euclidean distance between the super-resolution picture features and the high-resolution search library picture features, and sequencing identity matching according to the Euclidean distance.

In other embodiments, the following technical solutions are adopted:

a "prediction-restoration-recognition" based low resolution pedestrian re-recognition system comprising:

the device is used for inputting the acquired low-resolution picture to be recognized into the trained deep neural network model, and performing detail recovery under the optimal scale to obtain a super-resolution picture;

and the device is used for calculating the Euclidean distance between the super-resolution picture features and the high-resolution search library picture features and sequencing identity matching according to the Euclidean distance.

In other embodiments, the following technical solutions are adopted:

a terminal device comprising a processor and a computer-readable storage medium, the processor being configured to implement instructions; the computer readable storage medium is for storing a plurality of instructions adapted to be loaded by a processor and to perform the above-described "prediction-restoration-recognition" based low resolution pedestrian re-recognition method.

In other embodiments, the following technical solutions are adopted:

a computer readable storage medium having stored therein a plurality of instructions adapted to be loaded by a processor of a terminal device and to execute the above-mentioned "prediction-restoration-recognition" based low resolution pedestrian re-recognition method.

Compared with the prior art, the invention has the beneficial effects that:

(1) the predictor of the deep neural network model can adaptively predict a better scale factor according to the content of the low-resolution picture so as to achieve better recovery and recognition effects.

(2) The scale factor measurement provided by the invention evaluates the quality of each scale factor in a self-supervision mode without depending on manual marking of the optimal scale factor, and the whole network model can be trained end to end, thereby improving the training efficiency.

Drawings

Fig. 1 is a flowchart of a low-resolution pedestrian re-identification method based on "prediction-recovery-identification" in the embodiment of the present invention.

Detailed Description

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

Example one

In one or more embodiments, a low-resolution pedestrian re-identification method based on "prediction-recovery-identification" is disclosed, which can predict a better scale factor for a low-resolution pedestrian picture, perform detail recovery and picture amplification under the scale factor, and then perform identity identification on the recovered picture.

The method specifically comprises the following steps:

(1) inputting the acquired low-resolution picture to be identified into a trained deep neural network model, and performing detail recovery under the optimal scale to obtain a super-resolution picture;

(2) and calculating Euclidean distance between the super-resolution picture features and the high-resolution search library picture features, and sequencing identity matching according to the Euclidean distance.

It should be noted that the resolution in this embodiment refers to the product of the picture width and the picture height; the resolution is not absolute, but rather a relative concept, the resolution of the picture in the high resolution search library (gallery) is typically 2 times to 4 times higher than the query picture (query) (i.e. the low resolution picture to be identified), or even more. When the super-resolution restoration is performed on the picture, it is often necessary to specify a scale factor, for example, the scale factor is 3, which means that the super-resolution picture is obtained after the width and the height of the picture are both amplified by 3 times.

Specifically, in this embodiment, the process of establishing the deep neural network model and training the model specifically includes the following steps:

(1) and establishing a deep neural network model and setting a corresponding module network structure. As shown in FIG. 1, the model comprises three parts, namely an adaptive Scale Factor predictor P (adaptive Scale Factor predictor), a Super-Resolution Module G (Super-Resolution Module) and a Re-identification Module F (Re-id Module). In addition, N selectable scale factors { r } are preset₁,r₂,…,r_N}。

(2) Grouping data in a training data set, one set of training data consisting of three pictures { x_l,x_sl,x_hAnd (9) composition. x is the number of_lIs a low resolution picture, x_hIs a high resolution picture with the same identity tag but a different camera tag. x is the number of_slIs formed by x_hThe resulting composite low resolution picture is directly downsampled.

(3) X is to be_lInput to the predictor P. P can be viewed as a classifier, outputting x_lRespectively belonging to N classes r₁,r₂,…,r_NProbability of

Can be expressed as

Also denotes x_lRespectively { r }₁,r₂,…,r_NProbability of the next frame. The scale factor corresponding to the maximum probability is the predicted optimal scale r_pI.e. by

(4) X is to be_lAnd x_slAnd inputting the image data into a super-resolution module G for detail recovery and image amplification. x is the number of_lIn turn with { r₁,r₂,…,r_NMagnifying the scale factors to obtain N corresponding super-resolution pictures

x_slRandomly selecting a scale factor r_sl∈{r₁,r₂,…,r_NAmplifying to obtain corresponding super-resolution picture

High resolution picture x_hAnd super-resolution pictures

Form a super-resolution loss L_srAnd the method is used for training the super-resolution module. L is_srCan be expressed as

Wherein X_i,jIs the pixel value on the picture X coordinate (i, j), and W and H are the width and height of the picture, respectively.

(5) All super-resolution pictures

And high resolution picture x_hSending the pictures into a re-identification module F to extract features, and sequentially representing the features of the pictures as

To compare x_lIn the corresponding pictures with different scales, which recovers more discriminative content, we propose the concept of scale factor measurement. Scale factor measurement in feature space

Regarded as "anchors", calculated in turn

And

the euclidean distance of (c). When evaluating the scale factors, the probability that the scale factors corresponding to the features with closer distances are the optimal factors is higher.

The specific formula is as follows:

wherein d is_iIs that

And

the euclidean distance between them, gamma, is a regulatory factor.

Is represented by r_iIs a low resolution picture x_lThe probability of the best scale truth value. Gamma is an odd number to ensure a distance x in the feature space_hCloser scale factors are given higher probability.

We use

Forming an N-dimensional real valued vector

Called dynamic soft label, for supervising the prediction result of predictor P

Defining a predicted loss L_pThe following:

in use error inverseIn the case of the forward propagation algorithm, we cut off the error from L_pPropagation to dynamic soft tags. That is, the evaluated dynamic soft label is regarded as a true value and is used only for the supervision predictor P, and the super-resolution module G and the re-recognition module F are not included.

(6) Defining identity cross-entropy loss L using identity tag data in a dataset_idAnd identity ternary loss L_trip. The total loss L when training the model is as follows:

L＝L_sr+αL_id+βL_trip+λL_p

alpha, beta and lambda are all weight factors. We use an error back-propagation algorithm to minimize the objective function and thus optimize the parameters of the model.

(7) In order to give better initialization weights to the super-resolution module G and the re-identification module F, we perform pre-training before training the entire network. The predictor P and scale factor metrics are removed during pre-training. x is the number of_lAnd x_slA scale factor is randomly selected, amplified by a super-resolution module G and then input into a re-identification module F, and x_hThen directly into F. The predicted loss L is also correspondingly removed from the total loss function_pBy minimizing L only_sr,L_idAnd L_tripThe optimization of the model parameters is performed.

(8) When testing, inputting low resolution query picture (query) into predictor P, and selecting the scale factor r with highest prediction probability_pAnd G, as an optimal scale, restoring details according to the scale to obtain a super-resolution picture, and finally inputting the super-resolution picture into a re-identification module F. And the high resolution search library picture (gallery) is directly input to the re-recognition module F. In the feature space, the similarity of the super-resolution picture and the high-resolution picture is measured by calculating the Euclidean distance between the features of the super-resolution picture and the high-resolution picture. The smaller the euclidean distance between two picture features, the higher the similarity. The higher the similarity, the more forward the search pool picture (gallery) is located in the identity matching result of the query picture (query).

For the purpose of illustrating the detailed embodiments of the present invention, the MLR-CUHK03 data set is taken as an example. In this example, predictor P and re-recognition module F are both ResNet50 modules pre-trained on the ImageNet dataset, and a 1 × 1 convolutional layer is added after the global average pooling layer of the ResNet50 module to reduce the dimension of the feature vector from 2014 to 512. Super-resolution module G is a metassr module pre-trained on the DIV2K dataset. And predicting 4 scale factors as {1,2,3 and 4}, wherein the 4 scale factors respectively represent that the picture is amplified by 1 time, 2 times, 3 times and 4 times. The method comprises the following specific steps:

step S1, each low resolution picture x in the training data set_lHigh resolution map x randomly identical to an identity tag but different from a camera tag_hA set of training data is formed. And is x_hRandomly selecting a scale factor, and performing down-sampling of bilinear interpolation to obtain a synthesized low-resolution picture x_sl。

And step S2, pre-training the super-resolution module G and the re-recognition module F. x is the number of_lAnd x_slA scale factor is randomly selected, amplified by a super-resolution module G and then input to a re-identification module F, and x_hThen directly into F. Loss L at super resolution_srIdentity cross entropy loss L_idAnd identity ternary loss L_tripIs the total loss function, and G and P are optimized by minimizing the loss function. The training period for pre-training is set to 20. And the optimized G and P are used as initialization modules when the whole model is trained.

Step S3 to predict loss L_pSuper-resolution loss L_srIdentity cross entropy loss L_idAnd identity ternary loss L_tripThe weighted sum of (a) is the total loss function, and the entire model is trained. The training period is set to 60.

Step S4, during testing, a low-resolution query picture (query) is input into the predictor P, and the probability that each scale factor is the optimal scale of the picture is output. The scale factor r with the highest prediction probability_pAnd G, as an optimal scale, restoring details according to the scale to obtain a super-resolution picture, and inputting the super-resolution picture into a re-identification module F to extract features. High resolution search library picture (gallery) direct inputTo the re-identification module F. In the feature space, the similarity of the super-resolution picture features and the high-resolution picture features is measured by calculating the Euclidean distance between the super-resolution picture features and the high-resolution picture features. The smaller the euclidean distance between two picture features, the higher the similarity. And sorting the identity matching according to the sequence of the similarity of the characteristic distance between the search library picture (galery) and the query picture (query) from high to low.

Example two

In one or more embodiments, a "prediction-restoration-recognition" based low resolution pedestrian re-recognition system is disclosed, comprising:

It should be noted that the specific implementation manner of the apparatus is the same as that of the method disclosed in the first embodiment, and is not described again.

EXAMPLE III

In one or more implementations, a terminal device is disclosed, comprising a server including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor executes the computer program to implement a low resolution pedestrian re-identification method based on "prediction-recovery-identification" in the first embodiment. For brevity, no further description is provided herein.

It should be understood that in this embodiment, the processor may be a central processing unit CPU, and the processor may also be other general purpose processors, digital signal processors DSP, application specific integrated circuits ASIC, off-the-shelf programmable gate arrays FPGA or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and so on.

A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The memory may include both read-only memory and random access memory, and may provide instructions and data to the processor, and a portion of the memory may also include non-volatile random access memory. For example, the memory may also store device type information.

In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software.

The low-resolution pedestrian re-identification method based on the prediction-recovery-identification in the first embodiment can be directly implemented by a hardware processor, or implemented by a combination of hardware and software modules in the processor. The software modules may be located in ram, flash, rom, prom, or eprom, registers, among other storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor. To avoid repetition, it is not described in detail here.

Although the embodiments of the present invention have been described with reference to the accompanying drawings, it is not intended to limit the scope of the present invention, and it should be understood by those skilled in the art that various modifications and variations can be made without inventive efforts by those skilled in the art based on the technical solution of the present invention.

Claims

1. The low-resolution pedestrian re-identification method based on prediction-recovery-identification is characterized by comprising the following steps of:

calculating Euclidean distance between the super-resolution picture features and the high-resolution search library picture features, and sequencing identity matching according to the Euclidean distance; the deep neural network model includes:

a predictor for predicting each preset of the low resolution picturesThe probability that the scales are the optimal scales, respectively; specifically, the predictor extracts the characteristics of the low-resolution pictures, and the output low-resolution pictures belong to N categories { r }respectively₁，r₂，...，r_NProbability of (c) } of (c)

The described

Optimal scale factor r representing low resolution pictures₁，r₂，...，r_NAnd (4) realizing the measurement which is the optimal scale probability for each scale factor through the scale factor measurement, wherein the formula is as follows:

wherein N selectable scale factors { r ] are preset₁，r₂，...，r_N}，

Represents a scale factor r_iIs a low resolution picture x_lProbability of best scale truth, d_iRepresenting the Euclidean distance between each super-resolution picture feature corresponding to the low-resolution picture and the picture feature of the high-resolution search library, wherein gamma is a regulation factor;

the super-resolution module is used for traversing the N scale factors of the low-resolution picture and respectively generating N corresponding super-resolution pictures;

and the re-identification module is used for extracting picture features and calculating the Euclidean distance between each super-resolution picture feature corresponding to the low-resolution picture and the picture feature of the high-resolution search library so as to measure the similarity between the two picture features.

2. The "prediction-restoration-recognition-based" low-resolution pedestrian re-recognition method of claim 1, wherein the training process for the deep neural network model comprises:

grouping pictures in the constructed training data set, wherein each group of training data comprises a low-resolution picture, a high-resolution picture with the same identity label but different camera labels, and a synthesized low-resolution picture obtained by down-sampling the high-resolution picture;

for each group of training data sets, predicting the probability that each preset scale of the low-resolution pictures is the optimal scale through a deep neural network model, traversing N scale factors of the low-resolution pictures, and generating N corresponding super-resolution pictures respectively; randomly selecting a scale factor for synthesizing the low-resolution picture to generate 1 super-resolution picture;

calculating the Euclidean distance between each super-resolution picture feature and the high-resolution picture feature, wherein the calculation result is used as a dynamic soft label and represents the evaluation probability that each scale factor is the optimal scale;

different loss functions are constructed, and network parameters are trained and optimized by minimizing a weighted sum of the different loss functions.

3. The "prediction-restoration-recognition-based" low-resolution pedestrian re-recognition method of claim 2, wherein the loss function comprises: a resolution loss function, an identity cross entropy loss function, an identity ternary loss function and a prediction loss function; the dynamic soft label is used as a true value supervision signal of the prediction probability to form a prediction loss function.

4. The method for re-identifying pedestrians in low resolution based on "prediction-recovery-recognition" as claimed in claim 2, wherein, when training with each batch of training data, randomly selecting one from N scale factors for the input low resolution picture and the synthesized low resolution picture, and performing picture recovery in corresponding scale; and removing the predictor and the prediction loss function, taking the weighted sum of the super-resolution loss function, the identity cross entropy loss function and the identity ternary loss function as the total loss, and pre-training the super-resolution module and the re-recognition module by minimizing the total loss.

5. The method for re-identifying pedestrians in low resolution based on "prediction-recovery-recognition" as claimed in claim 2, wherein the tested low resolution picture is inputted into the predictor, the scale factor with the highest expected probability is used as the optimal scale, and then the super resolution picture is recovered according to the scale and sent into the re-identification module; the tested high-resolution picture is directly input to a re-identification module;

and calculating Euclidean distance between the super-resolution picture features and the high-resolution picture features in the feature space to measure the similarity of the super-resolution picture features and the high-resolution picture features.

6. The "prediction-recovery-recognition-based" low-resolution pedestrian re-recognition method according to claim 5, wherein the higher the similarity, the higher the position of the high-resolution search library picture in the identity matching result.

7. A "prediction-restoration-recognition" based low resolution pedestrian re-recognition system, comprising:

the device is used for inputting the acquired low-resolution picture to be identified into the trained deep neural network model, and performing detail recovery under the optimal scale to obtain a super-resolution picture;

the device is used for calculating Euclidean distances between the super-resolution picture features and the high-resolution search library picture features and sequencing identity matching according to the Euclidean distances; the deep neural network model includes:

the predictor is used for predicting the probability that each preset scale of the low-resolution picture is the optimal scale; specifically, the predictor extracts the characteristics of the low-resolution pictures, and the output low-resolution pictures belong to N categories { r }respectively₁，r₂，...，r_NProbability of

The above-mentioned

wherein N selectable scale factors { r ] are preset₁，r₂，...，r_N}，

8. A terminal device comprising a processor and a computer-readable storage medium, the processor being configured to implement instructions; the computer-readable storage medium storing a plurality of instructions adapted to be loaded by a processor and to perform the "prediction-restoration-recognition" based low resolution pedestrian re-recognition method of any one of claims 1 to 6.

9. A computer-readable storage medium having stored therein a plurality of instructions, wherein the instructions are adapted to be loaded by a processor of a terminal device and to perform the "prediction-restoration-recognition" based low resolution pedestrian re-recognition method according to any one of claims 1 to 6.