CN111488833A

CN111488833A - Pedestrian re-identification method and device, electronic equipment and storage medium

Info

Publication number: CN111488833A
Application number: CN202010283986.3A
Authority: CN
Inventors: 张润泽; 金良; 郭振华
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2020-04-08
Filing date: 2020-04-08
Publication date: 2020-08-04
Also published as: WO2021203801A1

Abstract

The application discloses a pedestrian re-identification method, a pedestrian re-identification device, an electronic device and a computer readable storage medium, wherein the method comprises the following steps: acquiring a pedestrian picture and a pedestrian re-identification network; the pedestrian re-identification network comprises a backbone network, a global feature extraction branch and a local feature extraction branch; inputting the pedestrian picture into a backbone network to obtain a feature map corresponding to the pedestrian picture, extracting global features from the feature map by using a global feature extraction branch, and extracting local features from the feature map by using a local feature extraction branch; and calculating loss amounts of the global features and the local features by using the loss function, and performing parameter adjustment on the pedestrian re-identification network based on the loss amounts to obtain an adjusted pedestrian re-identification network so as to perform pedestrian re-identification by using the adjusted pedestrian re-identification network. The pedestrian re-identification method improves the migration effect of the pedestrian re-identification network.

Description

Pedestrian re-identification method and device, electronic equipment and storage medium

Technical Field

The present application relates to the technical field of pedestrian re-identification, and more particularly, to a method and an apparatus for pedestrian re-identification, an electronic device, and a computer-readable storage medium.

Background

Pedestrian Re-identification (ReID), i.e. a pedestrian interested in one camera, retrieves all pictures of the pedestrian on other cameras. The current ReID field mainly focuses on two major directions, feature learning and metric learning. The goal of metric learning is to map features into different spaces so that feature vectors of the same person are closer, while feature vectors of different persons are farther apart. The feature learning is how to design the network to learn more discriminative features.

Many challenges exist in the current ReID task, and many challenges such as shading, visual angle, illumination, blurring can be caused by different cameras, and meanwhile, a low-resolution picture shot by a monitoring camera also causes great difficulty to ReID. These difficulties all result in increased intra-class variation and decreased inter-class variation. Therefore, it is important to distinguish between intra-class and inter-class features.

Meanwhile, the number of pictures in the public data set of the ReID data set is small, and the difference between the pictures and the pictures in the Imagenet data set is large. It is generally not good at migrating ReID datasets from a feature map generated by an Imagenet pre-trained backbone network.

Therefore, how to improve the migration effect of the pedestrian re-identification network is a technical problem to be solved by those skilled in the art.

Disclosure of Invention

The application aims to provide a pedestrian re-identification method and device, an electronic device and a computer readable storage medium, and the migration effect of a pedestrian re-identification network is improved.

In order to achieve the above object, the present application provides a pedestrian re-identification method, including:

acquiring a pedestrian picture and a pedestrian re-identification network; the pedestrian re-identification network comprises a backbone network, a global feature extraction branch and a local feature extraction branch;

inputting the pedestrian picture set into the backbone network to obtain a feature map corresponding to the pedestrian picture, extracting global features from the feature map by using the global feature extraction branch, and extracting local features from the feature map by using the local feature extraction branch;

and calculating the loss amount of the global features and the local features by using a loss function, and carrying out parameter adjustment on the pedestrian re-identification network based on the loss amount to obtain an adjusted pedestrian re-identification network so as to carry out pedestrian re-identification by using the adjusted pedestrian re-identification network.

Wherein the backbone network comprises a shallow network and a deep network; the shallow network is a network combining a Batch Norm and an instant Norm, and the deep network is the Batch Norm.

Wherein the local feature extraction branch comprises a first local feature extraction branch and a second local feature extraction branch, and the extracting local features from the feature map by using the local feature extraction branch comprises:

dividing the feature map into two parts by using a first local feature extraction branch, and respectively extracting features of the two parts as local features;

and dividing the feature map into three parts by using a second local feature extraction branch, and extracting the features of the three parts respectively as the local features.

Wherein the calculating the loss amount of the global feature and the local feature by using the loss function includes:

determining a first loss function corresponding to the global feature and a second loss function corresponding to the local feature;

and calculating the loss amount of the global feature by using the first loss function, and calculating the loss amount of the local feature by using the second loss function.

Wherein the first loss function comprises a cross-entropy loss function and a triplet loss function, and the second loss function comprises a cross-entropy loss function.

Wherein the triple loss function is a sample-hard sampling triple loss function.

In order to achieve the above object, the present application provides a pedestrian re-identification apparatus, including:

the acquisition module is used for acquiring a pedestrian picture and a pedestrian re-identification network; the pedestrian re-identification network comprises a backbone network, a global feature extraction branch and a local feature extraction branch;

the extraction module is used for inputting the pedestrian picture into the backbone network to obtain a feature map corresponding to the pedestrian picture, extracting global features from the feature map by using the global feature extraction branch, and extracting local features from the feature map by using the local feature extraction branch;

and the adjusting module is used for calculating the loss amount of the global features and the local features by using a loss function, and carrying out parameter adjustment on the pedestrian re-identification network based on the loss amount to obtain an adjusted pedestrian re-identification network so as to carry out pedestrian re-identification by using the adjusted pedestrian re-identification network.

Wherein the adjustment module comprises:

a determining unit, configured to determine a first loss function corresponding to the global feature and a second loss function corresponding to the local feature;

a calculating unit, configured to calculate a loss amount of the global feature by using the first loss function, and calculate a loss amount of the local feature by using the second loss function;

and the adjusting unit is used for carrying out parameter adjustment on the pedestrian re-identification network based on the loss amount to obtain an adjusted pedestrian re-identification network so as to carry out pedestrian re-identification by using the adjusted pedestrian re-identification network.

To achieve the above object, the present application provides an electronic device including:

a memory for storing a computer program;

a processor for implementing the steps of the pedestrian re-identification method as described above when executing the computer program.

To achieve the above object, the present application provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the pedestrian re-identification method as described above.

According to the scheme, the pedestrian re-identification method comprises the following steps: acquiring a pedestrian picture and a pedestrian re-identification network; the pedestrian re-identification network comprises a backbone network, a global feature extraction branch and a local feature extraction branch; inputting the pedestrian picture into the backbone network to obtain a feature map corresponding to the pedestrian picture, extracting global features from the feature map by using the global feature extraction branch, and extracting local features from the feature map by using the local feature extraction branch; and calculating the loss amount of the global features and the local features by using a loss function, and carrying out parameter adjustment on the pedestrian re-identification network based on the loss amount to obtain an adjusted pedestrian re-identification network so as to carry out pedestrian re-identification by using the adjusted pedestrian re-identification network.

According to the pedestrian re-identification method, the backbone network combining the Batch Norm and the Instance Norm is adopted, and the Instance Norm is combined with the traditional Batch Norm due to the fact that the Instance Norm is high in robustness, and the migration effect of the pedestrian re-identification network is good. In the aspect of feature extraction, a mode of combining global features and local features is adopted, and the discrimination of the features is improved. Therefore, the pedestrian re-identification method provided by the application can effectively transfer the model trained by a large data set such as Imagenet to the ReID data set, and can also effectively utilize the characteristics extracted by the network. The application also discloses a pedestrian re-identification device, an electronic device and a computer readable storage medium, which can also achieve the technical effects.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts. The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure without limiting the disclosure. In the drawings:

FIG. 1 is a flow chart illustrating a method of pedestrian re-identification in accordance with an exemplary embodiment;

FIG. 2 is a block diagram illustrating a backbone network in accordance with an exemplary embodiment;

FIG. 3 is a block diagram illustrating a pedestrian re-identification network in accordance with one exemplary embodiment;

FIG. 4 is a block diagram illustrating a pedestrian re-identification arrangement in accordance with one exemplary embodiment;

FIG. 5 is a block diagram illustrating an electronic device in accordance with an exemplary embodiment.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The embodiment of the application discloses a pedestrian re-identification method, which improves the migration effect of a pedestrian re-identification network.

Referring to fig. 1, a flowchart of a pedestrian re-identification method according to an exemplary embodiment is shown, as shown in fig. 1, including:

s101: acquiring a pedestrian picture and a pedestrian re-identification network; the pedestrian re-identification network comprises a backbone network, a global feature extraction branch and a local feature extraction branch;

in this embodiment, the pedestrian picture set may adopt a large data set such as Imagenet, the pedestrian re-recognition network is used to extract features of the pedestrian picture, the pedestrian re-recognition network extracts features using a backbone network, and then the features are trained by using a loss function through a gap (global average firing) and a full connection layer.

In a specific implementation, the backbone network includes a shallow network and a deep network; the shallow network is a network combining a Batch Norm (BN) and an Instant Norm (IN), and the deep network is the Batch Norm. The Batch Norm can solve the problem that the deep neural network is too deep to be forwarded. The output values of each layer have different mean values and variances, and the output data distribution is different. Thus Batch Norm increases the difference between samples. Due to the many challenges of viewing angle, lighting, blur, etc. captured by different cameras of the ReId data set, it is desirable to have characteristics that are invariant to these characteristics. And the output result of the model is not influenced by the pictures with changed factors such as brightness, color and the like in the data set. The InstanceNorm can well maintain the appearance invariance of the characteristic diagram. Therefore, this embodiment uses the combination of Batch Norm and InstanceNorm to combine Batch Norm and InstanceNorm in the shallow layer of the backbone network, and only Batch Norm is used in the deep layer of the backbone network. Therefore, abundant semantic information can be guaranteed, and characteristics such as brightness and color can be filtered.

The specific network structure can refer to Resnet, taking Resnet50 as an example, a stem network is the same as Resnet50, and comprises max pooling and Re L U with stride of 2, 7 × 7 convolution, Batch Norm and stride of 2, wherein 4 stages are included, each Stage comprises 3, 4, 6 and 3 residual blocks respectively, and the specific difference is that the residual blocks adopted at Stage1 and Stage2 are the structures shown in fig. 2, and the residual blocks adopted at Stage3 and Stage4 are consistent with Resnet 50.

S102: inputting the pedestrian picture into the backbone network to obtain a feature map corresponding to the pedestrian picture, extracting global features from the feature map by using the global feature extraction branch, and extracting local features from the feature map by using the local feature extraction branch;

in this step, the backbone network is used to extract a feature map corresponding to the pedestrian picture, and the global feature extraction branch and the local feature extraction branch are respectively used to extract global features and local features. Specifically, the local feature extraction branch includes a first local feature extraction branch and a second local feature extraction branch, and the step of extracting the local feature from the feature map by using the local feature extraction branch includes: dividing the feature map into two parts by using a first local feature extraction branch, and respectively extracting features of the two parts as local features; and dividing the feature map into three parts by using a second local feature extraction branch, and extracting the features of the three parts respectively as the local features.

As shown IN fig. 3, the backbone network shallow layer includes a stem network, a stage1 network and a stage2 network (corresponding to three cubes from large to small IN the shallow network IN fig. 3), wherein the stages 1 and 2 include a normalization layer for merging IN and BN, and the deep network includes only a BN layer. The feature expression adopts a mode of combining global features and local features, wherein the uppermost branch represents a global feature extraction branch, and the lower two branches represent local feature extraction branches. The three branches adopt similar network structures, and all adopt the network structures consistent with the stages 3 and 4 of Resnet 50. The difference is that stage4 on the global feature extraction branch takes a stride of 2 downsampling followed by GAP. Stage4 of the local feature extraction branch does not down-sample the feature graph in order to keep the details of the local features as much as possible, and the GAP is a GAP in the horizontal direction, where the first local feature extraction branch divides the feature graph into two parts and the second local feature extraction branch divides the feature graph into three parts.

S103: and calculating the loss amount of the global features and the local features by using a loss function, and carrying out parameter adjustment on the pedestrian re-identification network based on the loss amount to obtain an adjusted pedestrian re-identification network so as to carry out pedestrian re-identification by using the adjusted pedestrian re-identification network.

In this step, different loss functions are used for the global feature and the local feature, that is, this step may include: determining a first loss function corresponding to the global feature and a second loss function corresponding to the local feature; and calculating the loss amount of the global feature by using the first loss function, and calculating the loss amount of the local feature by using the second loss function.

In fig. 3, the global feature adopts a combination of a cross entropy loss function (Softmax L oss) and a Triplet loss function (Triplet L oss), and the local feature only calculates the cross entropy loss (Softmax L oss), that is, the first loss function in this step includes a cross entropy loss function and a Triplet loss function, and the second loss function includes a cross entropy loss function, which may be specifically a hard-to-sample Triplet loss function.

According to the pedestrian re-identification method, the backbone network combining the Batch Norm and the Instance Norm is adopted, and the Instance Norm is combined with the traditional Batch Norm due to the fact that the Instance Norm is high in robustness, and the migration effect of the pedestrian re-identification network is good. In the aspect of feature extraction, a mode of combining global features and local features is adopted, and the discrimination of the features is improved. Therefore, the pedestrian re-identification method provided by the embodiment of the application can effectively transfer the model trained by the large data set such as Imagenet and the like to the ReID data set, and can effectively utilize the characteristics extracted by the network.

In the following, an application embodiment provided by the present application is described, in which the pedestrian re-identification network passes through the deep learning framework, pytorch, and the deep learning model libraries, torchvision, and the reid mainstream algorithm library, torchreid, of the current mainstream.

The method is carried out in an experimental environment of 1V 100 GPU, a Market1501 data set and a NAIC2019 pedestrian re-recognition match initial race data set are adopted in a database, the Market1501 data set comprises 751 persons, 12936 images, 17 and 2 training data are averagely contained in each person, 750 persons are contained in a test set, 19732 images are totally contained, 26.3 testing data are averagely contained in each person, the NAIC2019 pedestrian re-recognition match data set comprises 20429 images, the query in the test set comprises 1349 images, and the garley comprises 5366 images, the model adopts a rest 50, a se _ Resnet50, a se _ Resnet101, a Batch size is 64, an initial learning rate is 0.00035, horizontal overturning and random erasing are adopted for data enhancement during training, and the resolution of an input image is 384 × 128.

The Market1501 data is compiled as follows, and the Rank1 index is adopted here:

	Resnet 50	Se_Resnet 50	Se_Resnet 101
				PCB	93.8	94.2	94.5
Luo baseline	94.8	95.3	95.5
				this application	95.4	95.8	96.3

The results in the NAIC dataset are as follows, with the mean of rank1 and maps:

	Resnet 50	Se_Resnet 50	Se_Resnet 101
				PCB	80.8	81.1	82.3
Luo baseline	82	82.4	83.3
				this application	84.6	85.2	86.2

The two groups of experiments respectively compare the performance of the algorithms on the Market1501 data set and the NAIC match data set, the PCB is a network only using local features, and L uo baseline adopts training skills and achieves higher results only using the skill features, so that the performance of the method provided by the application is obviously superior to that of the two methods.

In the following, a pedestrian re-identification device provided by an embodiment of the present application is introduced, and a pedestrian re-identification device described below and a pedestrian re-identification method described above may be referred to each other.

Referring to fig. 4, a block diagram of a pedestrian re-identification apparatus according to an exemplary embodiment is shown, as shown in fig. 4, including:

an obtaining module 401, configured to obtain a pedestrian picture and a pedestrian re-identification network; the pedestrian re-identification network comprises a backbone network, a global feature extraction branch and a local feature extraction branch;

an extracting module 402, configured to input the pedestrian picture into the backbone network to obtain a feature map corresponding to the pedestrian picture, extract global features from the feature map by using the global feature extracting branch, and extract local features from the feature map by using the local feature extracting branch;

an adjusting module 403, configured to calculate loss amounts of the global feature and the local feature by using a loss function, and perform parameter adjustment on the pedestrian re-identification network based on the loss amounts to obtain an adjusted pedestrian re-identification network, so as to perform pedestrian re-identification by using the adjusted pedestrian re-identification network.

According to the pedestrian re-identification device provided by the embodiment of the application, the backbone network combining the Batch Norm and the Instance Norm is adopted, and the Instance Norm is combined with the traditional Batch Norm due to the fact that the robustness of the Instance Norm is strong, and the migration effect of the pedestrian re-identification network is good. In the aspect of feature extraction, a mode of combining global features and local features is adopted, and the discrimination of the features is improved. Therefore, the pedestrian re-identification device provided by the embodiment of the application can effectively transfer the model trained by the large data set such as Imagenet to the ReID data set, and can effectively utilize the characteristics extracted by the network.

On the basis of the above embodiment, as a preferred implementation, the backbone network includes a shallow network and a deep network; the shallow network is a network combining a Batch Norm and an instant Norm, and the deep network is the Batch Norm.

On the basis of the foregoing embodiment, as a preferred implementation manner, the local feature extraction branch includes a first local feature extraction branch and a second local feature extraction branch, and the extraction module 402 includes:

the input unit is used for inputting each pedestrian picture in the pedestrian picture set into the backbone network to obtain a characteristic diagram corresponding to each pedestrian picture;

a first extraction unit configured to divide the feature map into two parts by using a first local feature extraction branch, and extract features of the two parts as the local features, respectively;

and the second extraction unit is used for dividing the feature map into three parts by using a second local feature extraction branch and respectively extracting the features of the three parts as the local features.

On the basis of the foregoing embodiment, as a preferred implementation, the adjusting module 403 includes:

On the basis of the foregoing embodiment, as a preferred implementation, the first loss function includes a cross entropy loss function and a triplet loss function, and the second loss function includes a cross entropy loss function.

Based on the above embodiment, as a preferred implementation, the triplet loss function is specifically a hard-to-sample sampling triplet loss function.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

The present application further provides an electronic device, and referring to fig. 5, a structure diagram of an electronic device 500 provided in an embodiment of the present application may include a processor 11 and a memory 12, as shown in fig. 5. The electronic device 500 may also include one or more of a multimedia component 13, an input/output (I/O) interface 14, and a communication component 15.

The processor 11 is configured to control the overall operation of the electronic device 500, so as to complete all or part of the steps in the above-mentioned pedestrian re-identification method. The memory 12 is used to store various types of data to support operation at the electronic device 500, such as instructions for any application or method operating on the electronic device 500, and application-related data, such as contact data, messaging, pictures, audio, video, and so forth. The Memory 12 may be implemented by any type of volatile or non-volatile Memory device or combination thereof, such as Static Random Access Memory (SRAM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Erasable Programmable Read-Only Memory (EPROM), Programmable Read-Only Memory (PROM), Read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic disk or optical disk. The multimedia component 13 may include a screen and an audio component. Wherein the screen may be, for example, a touch screen and the audio component is used for outputting and/or inputting audio signals. For example, the audio component may include a microphone for receiving external audio signals. The received audio signal may further be stored in the memory 12 or transmitted via the communication component 15. The audio assembly also includes at least one speaker for outputting audio signals. The I/O interface 14 provides an interface between the processor 11 and other interface modules, such as a keyboard, mouse, buttons, etc. These buttons may be virtual buttons or physical buttons. The communication component 15 is used for wired or wireless communication between the electronic device 500 and other devices. Wireless communication, such as Wi-Fi, bluetooth, Near Field Communication (NFC), 2G, 3G or 4G, or a combination of one or more of them, so that the corresponding communication component 15 may include: Wi-Fi module, bluetooth module, NFC module.

In an exemplary embodiment, the electronic Device 500 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable logic devices (Programmable L ic devices, P L D), Field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors or other electronic components for performing the above-mentioned pedestrian re-identification method.

In another exemplary embodiment, a computer readable storage medium comprising program instructions which, when executed by a processor, implement the steps of the above-described pedestrian re-identification method is also provided. For example, the computer readable storage medium may be the memory 12 described above including program instructions executable by the processor 11 of the electronic device 500 to perform the pedestrian re-identification method described above.

The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description. It should be noted that, for those skilled in the art, it is possible to make several improvements and modifications to the present application without departing from the principle of the present application, and such improvements and modifications also fall within the scope of the claims of the present application.

It is further noted that, in the present specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Claims

1. A pedestrian re-identification method is characterized by comprising the following steps:

inputting the pedestrian picture into the backbone network to obtain a feature map corresponding to the pedestrian picture, extracting global features from the feature map by using the global feature extraction branch, and extracting local features from the feature map by using the local feature extraction branch;

2. The pedestrian re-identification method according to claim 1, wherein the backbone network includes a shallow network and a deep network; the shallow network is a network combining a Batch Norm and an instant Norm, and the deep network is the Batch Norm.

3. The pedestrian re-recognition method according to claim 1, wherein the local feature extraction branch includes a first local feature extraction branch and a second local feature extraction branch, and the extracting local features from the feature map using the local feature extraction branch includes:

4. The pedestrian re-identification method according to any one of claims 1 to 3, wherein the calculating the loss amount of the global feature and the local feature using a loss function includes:

5. The pedestrian re-identification method of claim 4, wherein the first loss function comprises a cross entropy loss function and a triplet loss function, and the second loss function comprises a cross entropy loss function.

6. The pedestrian re-identification method of claim 5, wherein the triplet loss function is embodied as a hard-to-sample sampled triplet loss function.

7. A pedestrian re-recognition apparatus, comprising:

the extraction module is used for enabling the pedestrian picture to enter the backbone network to obtain a feature map corresponding to the pedestrian picture, extracting global features from the feature map by using the global feature extraction branch, and extracting local features from the feature map by using the local feature extraction branch;

8. The pedestrian re-identification apparatus of claim 7, wherein the adjustment module comprises:

9. An electronic device, comprising:

a memory for storing a computer program;

a processor for implementing the steps of the pedestrian re-identification method according to any one of claims 1 to 6 when executing the computer program.

10. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the steps of the pedestrian re-identification method according to any one of claims 1 to 6.