CN111832348A

CN111832348A - Pedestrian re-identification method based on pixel and channel attention mechanism

Info

Publication number: CN111832348A
Application number: CN201910310802.5A
Authority: CN
Inventors: 王敏杰; 李现�; 张加焕; 肖江剑
Original assignee: Ningbo Institute of Material Technology and Engineering of CAS
Current assignee: Ningbo Institute of Material Technology and Engineering of CAS
Priority date: 2019-04-17
Filing date: 2019-04-17
Publication date: 2020-10-27
Anticipated expiration: 2039-04-17
Also published as: CN111832348B

Abstract

The invention discloses a pedestrian re-identification method based on a pixel and channel attention mechanism, which comprises the following steps of: extracting the global characteristics of the pedestrian according to the bounding box (search box) of the person; averagely dividing a pedestrian picture into two parts and three parts, and respectively extracting local features of pedestrians; and matching the extracted character features with the character information in the Gallery to find out the required character information. The method utilizes the channel and the pixel attention module to extract the features, thereby effectively reducing the influence of background information on a retrieval result; meanwhile, middle-layer supervision is further designed for the neural network, and in the feature extraction process, a multi-loss function is used for supervising middle-layer feature information to accelerate network convergence; the pedestrian re-identification network based on the channel attention mechanism, the pixel attention mechanism and the intermediate layer supervision can effectively delete redundant information in the character bounding box, so that character information is effectively aggregated, and the retrieval precision is obviously improved.

Description

Pedestrian re-identification method based on pixel and channel attention mechanism

Technical Field

The invention relates to a pedestrian re-identification method, in particular to a pedestrian re-identification method based on a pixel and channel attention mechanism, and belongs to the technical field of image processing.

Background

At present, various criminal behaviors at home and abroad pose a very great threat to the sustainable and stable development of society. In places with large people flow, such as shopping malls, stations, airports, pedestrian streets and the like, monitoring equipment with large and small sizes are distributed, but how to accurately find out people or information which are needed by people from the monitoring information still presents a great challenge. Especially in criminal investigation work, the policeman needs to find criminal suspect information from a large amount of long-time monitoring information, know the conditions in time and control the criminal suspect information. However, the monitoring information is huge in quantity, complicated in content, and small in angle of view of monitoring, and it becomes very difficult to find out the target person quickly and accurately. Although the face recognition technology is mature at present, the face recognition technology is widely applied to various fields. However, in the surveillance video, due to the problems of the resolution and the shooting angle of the camera, people cannot capture clear and effective face pictures, and people information cannot be retrieved by using a face recognition technology.

In order to solve the problem of person retrieval under complex conditions, a pedestrian re-identification technology is also called as pedestrian re-identification technology. The technology uses a computer to retrieve the character information, and can save a large amount of manpower and material resources. With the development of deep learning, a re-recognition method based on deep learning also becomes the mainstream of pedestrian re-recognition technology. The existing re-identification method based on deep learning is mainly divided into the following five categories: the re-identification method is based on characterization learning, metric learning, local feature, video sequence and GAN mapping.

These methods are widely used in human re-identification studies, but they also have many problems. Based on the method of characterization learning, global features are used as feature vectors, so that many detail features are lost in feature extraction, and errors occur in retrieval results. The method based on metric learning is to compare the similarity distance between two pictures through a neural network, and how to accurately calculate the similarity between the pictures is still a subject to be researched. The method divides a figure picture into a plurality of parts in the vertical direction, and then extracts the local features of the picture respectively. However, when dividing pictures, the dividing is often inaccurate due to the posture of the person, and the like, and the system accuracy is seriously affected. Video sequence-based re-identification techniques also require further exploration in the problem of how to remove redundant frames. At present, pictures generated by the GAN-based method can only be used as negative samples, and the distortion is relatively serious.

In addition to the drawbacks of the above methods, the low resolution, occlusion, view angle, posture and illumination variation of the camera can cause many adverse effects on the re-recognition system. At present, the pedestrian re-identification method based on deep learning uses pooling operation to perform data dimension reduction on feature extraction, but all channels and pixel information in a to-be-treated picture are all treated in the same way no matter maximum pooling or average pooling is adopted. Particularly, a bounding box (search box) contains person information and background information, which cannot be distinguished by a neural network, so that the background information is also used as a part of the person characteristics during feature extraction, which may have a great negative effect on the accuracy of the entire re-recognition system. How to effectively reduce the influence of background information on the re-identification technology is a great challenge.

In order to effectively reduce the influence of background information on a retrieval result, the invention provides a method for extracting features by utilizing a channel and a pixel attention module. Before the maximum pooling and the average pooling, a channel and a pixel attention module are applied to delete redundant information and improve the effectiveness of the picture feature vector; meanwhile, the invention extracts the global and local characteristics of the pedestrian based on the neural network, further designs middle layer supervision for the neural network, and uses a multi-loss function to supervise the middle layer characteristic information in the characteristic extraction process, thereby quickening the network convergence and improving the retrieval precision.

Disclosure of Invention

The invention mainly aims to provide a pedestrian re-identification method based on a pixel and channel attention mechanism so as to overcome the defects in the prior art.

In order to achieve the purpose, the technical scheme adopted by the invention comprises the following steps:

extracting the global characteristics of the pedestrian according to the bounding box (search box) of the person;

averagely dividing a pedestrian picture into two parts and three parts, and respectively extracting local features of pedestrians;

and matching the extracted character features with the character information in the Gallery to find out the required character information.

Preferably, global and local features of the pedestrian are extracted based on the neural network, the extracted global features of the pedestrian include color and edge features, and the extracted local features of the pedestrian include color and edge features of different regions of the pedestrian in the vertical direction.

Preferably, in the process of extracting the local features of the pedestrian, the channel attention module and the pixel attention module are used for aggregating the character feature information, the extracted character features are character feature information obtained by feature aggregation through a neural network, and the character information in the Gallery is character feature information output after pictures in the Gallery are input into a trained model.

Preferably, the extracting global and local features of the pedestrian based on the neural network specifically includes:

using a ResNet-50 network as a basic network to extract picture characteristics, and using the first three layers of the ResNet-50 network; then dividing the whole network into three branches, extracting global features of the image in the first branch, dividing the feature tensor into two parts in the vertical direction by the second branch, and dividing the feature tensor into three parts in the vertical direction by the third branch; then, the channel attention module is used for aggregating the characteristic information and deleting redundant channel information; then using maximal pooling to reduce dimensions; finally, using 1 × 1 convolutional layer, reducing the dimension of the feature vector from 2048 to 256;

the first three layers of the ResNet network, i.e., layer2, layer3, are followed by middle layer supervision, in which pixel attention modules are used to reduce the value of background pixels and increase the value of human pixels.

Preferably, the channel attention module is implemented as follows:

let the size of the input tensor be H × W × C, and be denoted as X ═ X₁，x₂，…，x_c]Wherein H represents the height of the image, W represents the width of the image, and C represents the channel;

the first step is as follows: reducing the dimension of the characteristic information of each channel, and taking the characteristic of each channel after dimension reduction as F_cTo carry out the presentation of the contents,

wherein x is_c(i, j) is the value at position (i, j) on channel c, and the formula averages tensor in each channel, so that the characteristic aggregation effect can be achieved;

the second step is that: filtering each channel by using a filter, and deleting redundant information;

wherein, ω is_cRepresents the weight given to each channel, F_cRepresents the tensor value of the c channel, f₁Represents a filtering operation;

the third step: performing dimension increasing operation;

wherein the content of the first and second substances,

weight for each channel, Z_cFor the final weight of each channel, f₂Is an operation of ascending dimensionA function, representing a convolution operation;

the fourth step: the source tensor is weighted;

preferably, the pixel attention module is implemented as follows:

let the size of the input tensor be H × W × C, and be denoted as Y ═ Y₁，y₂，…，y_c]Wherein H represents the height of the image, W represents the width of the image, and C represents the channel;

the first step is as follows: compressing the number of channels to 1 based on the following formula (5) for subsequent processing;

the second step is that: rearranging the tensor values;

E_α＝g₀(D)，α＝3·j+i (6)

the third step: screening is carried out;

{I₁，I₂，…，I_c}＝g₁({η₁，η₂，…，η_α}·{E₁，E₂，…，E_α}) (7)

{J₁，J₂，…，J_α}＝g₂({γ₁，γ₂，…，γ_N}.{I₁，I₂，…，I_N}) (8)

the fourth step: restoring the obtained vector to the original mapsize;

K＝g₄(J) (9)

the fifth step: assigning a weight to each pixel;

Y_result-c(i，j)＝K(i，j)·Y(i，j) (10)。

compared with the prior art, the invention has the advantages that:

(1) by adopting the channel and pixel attention module to extract the features, the channel and pixel attention module is applied to delete redundant information and improve the effectiveness of the picture feature vector before the maximum pooling and average pooling operations; meanwhile, the invention extracts the global and local characteristics of the pedestrian based on the neural network, further designs middle layer supervision for the neural network, and uses a multi-loss function to supervise the middle layer characteristic information in the characteristic extraction process, thereby quickening the network convergence and improving the retrieval precision;

(2) the invention provides an innovative pedestrian re-identification network based on a channel attention mechanism, a pixel attention mechanism and intermediate layer supervision. The network can effectively delete redundant information in the character bounding box, so that the character information is effectively aggregated, and the retrieval precision is obviously improved;

(3) the invention uses three data sets of Market1501, DukeMTMC-reiD and CUHK03-NP to verify the experimental effect, and the result shows that compared with other methods, the re-identification network provided by the invention has the advantages that the two indexes of CMC and Map are remarkably improved, especially on the CUHK03-NP data set.

Drawings

FIG. 1 is a schematic diagram of a main workflow of pedestrian re-identification in an exemplary embodiment of the present invention;

FIG. 2 is a schematic diagram of a re-identification network structure including a channel and pixel attention mechanism in an exemplary embodiment of the invention;

FIG. 3 is a block diagram of a channel attention module in accordance with an exemplary embodiment of the present invention;

FIG. 4 is an attentionmap of a channel attention module in an exemplary embodiment of the invention;

FIG. 5 is a block diagram of a pixel attention module in accordance with an exemplary embodiment of the present invention;

FIG. 6 is an attention map of a pixel attention module in an exemplary embodiment of the invention;

FIG. 7 is a diagram illustrating the results of the search on the data sets Market1501, DukeMTMC-reiD and CUHK03-NP in accordance with an exemplary embodiment of the present invention.

Detailed Description

In view of the deficiencies in the prior art, the inventors of the present invention have made extensive studies and extensive practices to provide technical solutions of the present invention. The technical solution, its implementation and principles, etc. will be further explained as follows.

Referring to fig. 1, in fig. 1, CA represents a channel attention module, PA represents a pixel attention module, and a pedestrian re-identification method based on a pixel and channel attention mechanism includes:

firstly, extracting the global characteristics of the pedestrian according to a bounding box of the person;

then, the pedestrian picture is divided into two parts and three parts on average, local features of pedestrians are extracted respectively, and in the process, the pedestrian feature information is aggregated by using a channel and a pixel attention module;

and then matching the extracted character features with the character information in the Gallery to find out the required character information.

The extracted global features of the pedestrian mainly include features such as color and edge, and the local features of the pedestrian refer to features such as color and edge of different areas of the pedestrian in the vertical direction.

The character information in the Gallery specifically refers to character feature information output after pictures in the Gallery are input into a trained model. The extracted human features refer to human feature information obtained by feature aggregation through a neural network.

The invention extracts global and local features based on a neural network, and fig. 2 is a structure diagram of a re-recognition network including channels and a pixel attention mechanism, wherein it can be seen in the structure diagram of the whole network that the upper layer and three branch networks of the main network extract global features of people, and the middle layer and the lower layer of the main network extract local features of people.

The specific details of the overall neural network are described below:

(1) the overall network structure, as shown in fig. 2, in the figure, PA is a pixel attention model, CA is a channel attention model, triple _ Loss is a ternary Loss function, CrossEntropy Loss is a cross entropy Loss function, and Sum _ Loss is a total Loss function; the network uses the ResNet-50 network as the base network to extract picture features. The difference from the base network is that we only use the first three layers of the ResNet-50 network, after which we divide the entire network into three branches. In the first branch we extract the global features of the image, the second branch divides the feature tensor into two parts in the vertical direction, and the third branch divides the feature tensor into three parts in the vertical direction. We then used the channel attention module to aggregate the feature information, remove redundant channel information, then use max pooling to reduce the dimensions, and finally use 1 x 1 convolutional layer to reduce the dimensions of the feature vector from 2048 to 256. Also as shown in fig. 2, we add middle layer supervision after layer1, layer2, and layer3, where we use the pixel attention module to reduce the value of background pixels and increase the value of person pixels. The dimensions of the network profile are as shown in table 1.

Numbering	Module	Feature size	Dimension (d) of
				1	Layed	96×32	256
2	Layer2	48×16	512
				3	Layer3	24×8	1024
4	Branch_Global	12×4	2048
				5	Branch_Partl	24×8	2048
6	Branch_Part2	24×8	2048
				7	Channel Attention-l	12×4	2048
8	Channel Attention-2	24×8	2048
				9	Channel Attention-3	24×8	2048
10	Pixel Attention-1	96×32	256
				11	Pixel Attention-2	48×16	512
12	Pixel Attention-3	24×8	1024

Table 1. for the network profile information, the resolution of the input picture is set to 384 × 128.

(2) The channel attention module is structured as shown in fig. 3.

Before that, the cnn-based convolutional neural network gives the same weight to each channel of each tensor, but the same weight is different from the actual situation, redundant channel information cannot be deleted due to the same weight, and finally noise enters the final feature vector, so that the retrieval result is influenced. The key of the channel attention mechanism is how to endow each channel with different weight values; FIG. 3 is a diagram of the channel attention model structure that we have designed.

As shown in fig. 3,. AvgPool2d is an adaptive pooling layer, and Conv2d is a convolutional layer; let us denote the size of the input tensor as H × W × C, and denote X ═ X₁，x₂，…，x_c]In the first step, we need to perform dimension reduction on the feature information of each channel. The feature of each channel after dimensionality reduction is represented by F_cTo carry out the presentation of the contents,

wherein x_c(i, j) is the value at location (i, j) on channel c. The formula averages tensors in each channel, and the effect of feature aggregation can be achieved.

Then, the filter is used for filtering each channel, and redundant information is deleted.

In the formula (2), ω_cRepresents the weight given to each channel, F_cRepresents the tensor value of the c channel, f₁Representing a filtering operation.

And then performing dimension increasing operation.

In the formula (3)

Weight for each channel, Z_cFor the final weight of each channel, f₂Is a rising dimension operation function, which represents the convolution operation in the structure chart; and finally, giving weight to the source tensor.

Fig. 4 is an attention map of a channel attention module, in which "Input image" is an Input image of a model, and it can be known from the overall network structure diagram that we use the channel attention module in the upper, middle and lower branches of the main network; "No-CA 1" is the attention feature map of the non-channel attention model, and "CA 1" is the attention feature map after adding the channel attention model;

the right 6 images in fig. 4 show the feature aggregation effect of the model after the attention module is used, and the highlight part in the figure represents that the features of the part have important influence on the retrieval result. We can see that after using CA (channel attention module), the neural network can effectively delete the background information, the character features are strengthened, and the search result is positively influenced.

(3) The pixel attention module is shown in fig. 5.

In the present invention, we apply the pixel attention module to the middle monitor branch, and as with the channel attention, we set the size of the input tensor to H × W × C, and denote Y ═ C₁，y₂，…，y_c]The specific operation of the first step is shown in the following formula,

this operation compresses the channel number to 1 for subsequent processing. The tensor values are then rearranged as shown in fig. 5.

E_α＝g₀(D)，α＝3·j+i (6)

We then screened it, similar to the channel attention.

{J₁，J₂，…，J_α}＝g₂({γ₁，γ₂，…，γ_N}·{I₁，I₂，…，I_N}) (8)

The resulting vector is then restored to the original mapsize.

K＝g₄(J) (9)

Finally we assign a weight to each pixel.

Y_result-c(i，j)＝K(i，j)·Y(i，j) (10)。

FIG. 6 is an attention map of a pixel attention module, similar to the channel attention map, we use the pixel attention module in the three branches of layer, layer2 and layer 3; wherein, No-PA1 is the attention feature map of the non-pixel attention model, and PA1 is the attention feature map after the pixel attention model is added; as is apparent from fig. 6, after the pixel attention module is used, the environmental information is effectively subtracted, and the feature information of the person is further enhanced, so that the retrieval result is enhanced.

The invention provides an innovative pedestrian re-identification network based on a channel attention mechanism, a pixel attention mechanism and intermediate layer supervision. The network can effectively delete redundant information in the character bounding box, so that the character information is effectively aggregated, and the retrieval precision is obviously improved.

(4) Technical effects of the invention

The invention mainly uses three data sets of Markelt 501, DukeMTMC-reiD and CUHK03-NP to verify the experimental effect. Tables 2-4 show the results of the comparisons on the data sets Market1501, DukeMTMC-reiD, CUHK03-NP, respectively. Wherein RK stands for re-ranking algorithm.

TABLE 2 comparison of the results on the data set Market1501, RK stands for the re-ranking algorithm

TABLE 3 comparison of data sets DukeMTMC-relD, RK stands for re-ranking algorithm

TABLE 4 comparison of the results on the data set CUHK03-NP, RK stands for re-ranking algorithm

From tables 2-4, it can be seen that the re-identification network in the present invention has significantly improved both CMC and Map indexes compared with other methods, especially on the CUHK03-NP data set, the accuracies on CUHK03-labeled and CUHK 03-protected respectively reach rank1/mAP 80.9/78.7 and rank1/mAP 78.9/76.4, and the effect is far superior to other re-ID methods.

Table 5 shows the results of ablation experiments, which respectively test the effects of three network structures including backbone, backbone + CA, and backbone + CA + PA on the data sets of DukeMTMC-reID and CUHK03, and it can be seen that the CA and PA modules provided by the present invention have significant effects on improving the search effect of the original neural network.

TABLE 5 ablation test results

FIG. 7 is a graph of the search results on the datasets Markelt 501, DukeMTMC-reiD and CUHK03-NP using the present invention.

It should be understood that the above-mentioned embodiments are merely illustrative of the technical concepts and features of the present invention, which are intended to enable those skilled in the art to understand the contents of the present invention and implement the present invention, and therefore, the protection scope of the present invention is not limited thereby. All equivalent changes and modifications made according to the spirit of the present invention should be covered within the protection scope of the present invention.

Claims

1. A pedestrian re-identification method based on a pixel and channel attention mechanism is characterized by comprising the following steps:

extracting the global features of the pedestrians according to the retrieval frame of the person;

2. The method of claim 1 for pedestrian re-identification based on a pixel and channel attention mechanism, comprising: the method comprises the steps of extracting global and local features of the pedestrian based on a neural network, wherein the extracted global features of the pedestrian comprise color and edge features, and the extracted local features of the pedestrian comprise color and edge features of different areas of the pedestrian in the vertical direction.

3. The method of claim 2, comprising: the method comprises the steps that a channel attention module and a pixel attention module are used for aggregating character feature information in the process of extracting the local features of pedestrians, the extracted character features are character feature information obtained by feature aggregation through a neural network, and the character information in the Gallery is the character feature information output after pictures in the Gallery are input into a trained model.

4. The pedestrian re-identification method based on the pixel and channel attention mechanism according to claim 3, wherein the extracting global and local features of the pedestrian based on the neural network specifically comprises:

the first three layers of the ResNet network, i.e., layer1, layer2, and layer3, are followed by middle layer supervision, in which pixel attention modules are used to reduce the value of background pixels and increase the value of human pixels.

5. The pedestrian re-identification method based on the pixel and channel attention mechanism as claimed in claim 3, wherein the channel attention module is implemented as follows:

the third step: performing dimension increasing operation;

wherein the content of the first and second substances,

weight for each channel, Z_cFor the final weight of each channel, f₂Is a rising dimensional operation function, representing a convolution operation;

the fourth step: the source tensor is weighted;

6. the method of claim 3, wherein the pixel attention module is implemented as follows:

the second step is that: rearranging the tensor values;

E_α＝g₀(D)，α＝3·j+i (6)

the third step: screening is carried out;

the fourth step: restoring the obtained vector into an original mapsize, wherein the mapsize is the size of the feature map;

K＝g₄(J) (9)

the fifth step: assigning a weight to each pixel;

Y_result-c(i，j)＝K(i，j)·Y(i，j) (10)。