CN110659589A

CN110659589A - Pedestrian re-identification method, system and device based on attitude and attention mechanism

Info

Publication number: CN110659589A
Application number: CN201910840108.4A
Authority: CN
Inventors: 王坤峰; 王飞跃; 李雪松; 刘雅婷; 颜拥
Original assignee: Institute of Automation of Chinese Academy of Science; State Grid Zhejiang Electric Power Co Ltd
Current assignee: Institute of Automation of Chinese Academy of Science; State Grid Zhejiang Electric Power Co Ltd
Priority date: 2019-09-06
Filing date: 2019-09-06
Publication date: 2020-01-07
Anticipated expiration: 2039-09-06
Also published as: CN110659589B

Abstract

The invention belongs to the technical field of image recognition, and particularly relates to a pedestrian re-recognition method, system and device based on a posture and attention mechanism, aiming at solving the problems that the information of key points of an image cannot be accurately acquired due to the deviation of data sets of different tasks, and the re-recognition precision of pedestrians cannot meet the expected requirement. The method comprises the following steps: extracting the postures of the pedestrians and generating pedestrian key points; deleting redundant background information and correcting a pedestrian detection frame; extracting a first characteristic diagram and acquiring a hard attention diagram by adopting a hard attention mechanism module; fusing the first feature map and the hard attention map to obtain a second feature map; acquiring a soft attention diagram by adopting a soft attention mechanism module and then fusing again; and carrying out global average pooling and feature dimension reduction on the fused third feature map to obtain a feature vector for pedestrian re-identification. The invention combines a hard attention mechanism and a soft attention mechanism, effectively enhances the foreground information of the characteristic diagram, inhibits background noise and improves the accuracy and stability of pedestrian re-identification.

Description

Pedestrian re-identification method, system and device based on attitude and attention mechanism

Technical Field

The invention belongs to the technical field of computer image recognition, and particularly relates to a pedestrian re-recognition method, system and device based on a posture and attention mechanism.

Background

The pedestrian re-identification is a technology for finding out the same target under different cameras by utilizing a computer vision technology, is considered as a sub-problem of image retrieval, is widely applied to the fields of intelligent video monitoring, intelligent security and the like, and is an indispensable part for building a smart city.

Pedestrian re-identification techniques have received increasing attention. With the development of computer vision theory and the support of hardware system, the pedestrian re-identification technology has been greatly developed. Early pedestrian re-identification techniques utilized conventional methods to manually design features, but could only be applied to specific scenes. The feature representation capability is insufficient, and the model generalization capability is not strong. With the development of deep learning technology, a great number of deep learning technologies are applied to a pedestrian re-identification task, and the deep learning technologies are mainly divided into two methods, namely learning based on features and learning based on distance measurement. Although there is a great improvement in recognition accuracy, there are still some drawbacks. The main problems faced in pedestrian re-identification are: viewing angle changes, pedestrian mismatching due to detection inaccuracies, occlusion, and similar appearance, among others. Although some methods also use attitude information or attention mechanisms to solve these problems, the attitude estimation network is trained on the data set of the attitude estimation, and has a certain deviation from the data set of the pedestrian re-recognition, for example, a key point of the pedestrian cannot be accurately acquired on some images, and the deviation may cause the performance of the pedestrian re-recognition to be reduced.

In general, under the condition that data set deviations exist among different tasks, the prior art cannot accurately acquire the key point information of the image, and the re-identification precision of pedestrians cannot meet the expected requirement.

Disclosure of Invention

In order to solve the above problems in the prior art, that is, the prior art cannot accurately acquire the image key point information and the accuracy of pedestrian re-identification cannot meet the expected requirements under the condition that data set deviations exist among different tasks, the invention provides a pedestrian re-identification method based on an attitude and attention mechanism, which comprises the following steps:

step S10, acquiring a pedestrian image to be recognized as a first image;

step S20, extracting pedestrian attitude information of the first image by adopting an attitude estimation network, and generating pedestrian key points;

step S30, based on the pedestrian key points, deleting the redundant background information of the first image, and correcting a pedestrian detection frame to obtain a second image;

step S40, generating a feature map from the second image through a feature extraction network to obtain a first feature map; generating a hard attention diagram with the same size as the feature diagram by Gaussian, binary and normalization of the pedestrian key points;

step S50, fusing the first feature map and the hard attention map to obtain a second feature map;

step S60, acquiring a soft attention map with the same size as the second feature map through a soft attention network, and fusing the soft attention map with the second feature map to obtain a third feature map;

and step S70, performing global average pooling and feature dimension reduction on the third feature map to obtain a feature vector for calculating similarity to realize pedestrian matching, namely the feature vector for re-identifying pedestrians.

In some preferred embodiments, the redundant background information of the first image is:

and the upper, lower, left and right regions of the pedestrian in the first image.

In some preferred embodiments, in step S50, "fuse the first feature map and the hard attention map to obtain a second feature map", the method includes:

wherein, F₁Is a first characteristic diagram, F₂Is a second characteristic diagram, Mask_hIn an attempt to achieve a hard attention, the user is forced to,

respectively, element-by-element multiplication and element-by-element addition.

In some preferred embodiments, in step S60, "obtain a soft attention map with the same size as the second feature map through a soft attention network, and obtain a third feature map by fusing with the second feature map", the method includes:

step S61, obtaining a soft attention map with the same size as the second feature map through a soft attention network:

Mask_s＝Sigmoid(BN(Conv(ReLU(Conv(F₂)))))

wherein, Mask_sRepresenting a soft attention map, F₂Conv represents a 1 × 1 convolution operation, BN represents batch normalization, and Sigmoid and ReLU represent activation functions for a second characteristic diagram;

step S62, fusing the obtained soft attention map and the second feature map to obtain a third feature map:

wherein, F₂Is a second characteristic diagram, F₃Is a third characteristic diagram, Mask_sIn order to achieve a soft attention-force diagram,

In some preferred embodiments, in the training process of the network model, after "performing global average pooling and feature dimension reduction on the third feature map to obtain a feature vector for calculating similarity to realize pedestrian matching, i.e., a feature vector for pedestrian re-identification" in step S70, a step of supervised training is further provided, in which the method includes:

and performing supervised training on the extracted feature vectors on the acquired data set labeled with the pedestrian category by adopting cross entropy loss and triple loss.

In some preferred embodiments, the cross-entropy penalty is:

wherein L is_softmaxRepresenting a cross-entropy function, w_kWeight, w, representing the k-th class_iIs the weight corresponding to the ith image in one Bathsize, C represents the number of the pedestrian categories in the acquired data set marked with the pedestrian categories, N represents the number of the images contained in one Bathsize, f_iRepresenting the feature vector corresponding to the ith image in one Bathsize.

In some preferred embodiments, the triplet penalty is:

wherein L is_tripletRepresents the loss function of the triplet in the form of,

representing a feature vector extracted from any reference pedestrian image in the training image set;representing a feature vector extracted from another image representing the same person as the reference pedestrian as a positive sample;

feature direction extracted from image representing other personAmount, as negative sample; α represents a threshold of the triplet constraint; p indicates that P IDs exist in one Bathsize, and K indicates that K images are selected from one ID.

In another aspect of the invention, a pedestrian re-identification system based on a posture and attention mechanism is provided, and comprises an image acquisition module, a posture extraction module, a correction module, a hard attention diagram generation module, a soft attention diagram generation module, a fusion module, a feature vector acquisition module and an output module;

the image acquisition module is configured to acquire a pedestrian image to be identified as a first image and input the first image to the attitude extraction module;

the attitude extraction module is configured to extract pedestrian attitude information of the first image sent by the image acquisition module by adopting an attitude estimation network, and generate pedestrian key points;

the correction module is configured to delete the redundant background information of the first image based on the pedestrian key point, and correct a pedestrian detection frame to obtain a second image;

the hard attention map generation module is configured to generate a feature map from the second image through a feature extraction network to obtain a first feature map; generating a hard attention diagram with the same size as the feature diagram by Gaussian, binary and normalization of the pedestrian key points;

the fusion module is configured to fuse the first feature map and the hard attention map to obtain a second feature map;

the soft attention map generation module is configured to acquire a soft attention map with the same size as the second feature map through a soft attention network, and fuse the soft attention map and the second feature map by using the fusion module to obtain a third feature map;

the feature vector acquisition module is configured to perform global average pooling and feature dimension reduction on the third feature map to obtain feature vectors for pedestrian re-identification;

the output module is configured to output the obtained feature vectors for calculating the similarity to realize pedestrian matching, namely the feature vectors for re-identifying the pedestrians.

In a third aspect of the invention, a storage device is provided, in which a plurality of programs are stored, the programs being adapted to be loaded and executed by a processor to implement the above-described pedestrian re-identification method based on the attitude and attention mechanism.

In a fourth aspect of the present invention, a processing apparatus is provided, which includes a processor, a storage device; the processor is suitable for executing various programs; the storage device is suitable for storing a plurality of programs; the program is adapted to be loaded and executed by a processor to implement the above-described pedestrian re-identification method based on the attitude and attention mechanism.

The invention has the beneficial effects that:

(1) the pedestrian re-identification method based on the posture and the attention mechanism adopts the hard attention mechanism and the soft attention mechanism to combine the characteristics of image extraction for feature fusion, solves the problem of inaccurate extraction of image key point information caused by data set deviation among different tasks, can effectively enhance the foreground information of a feature map, inhibits background noise, enhances the discriminability and robustness of the extracted characteristics, and thus improves the accuracy and stability of pedestrian re-identification.

(2) The pedestrian re-identification method based on the posture and attention mechanism trains the acquired characteristic vector for pedestrian re-identification followed by cross entropy loss and triple loss, promotes the intra-class distance to be smaller and the inter-class distance to be larger, and improves the robustness of pedestrian re-identification.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is a schematic flow diagram of a pedestrian re-identification method based on an attitude and attention mechanism in accordance with the present invention;

FIG. 2 is a schematic diagram of a hard attention mechanism and a visualization effect diagram of an embodiment of a pedestrian re-identification method based on attitude and attention mechanisms according to the invention;

FIG. 3 is a schematic diagram of a soft attention mechanism and a visualization effect diagram of an embodiment of a pedestrian re-identification method based on attitude and attention mechanisms according to the invention;

FIG. 4 is a network diagram of a combination of a hard attention mechanism and a soft attention mechanism according to an embodiment of the pedestrian re-identification method based on attitude and attention mechanisms.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

The invention discloses a pedestrian re-identification method based on an attitude and attention mechanism, which comprises the following steps of:

step S10, acquiring a pedestrian image to be recognized as a first image;

In order to more clearly explain the pedestrian re-identification method based on the posture and attention mechanism of the present invention, the following describes the steps in the embodiment of the method of the present invention in detail with reference to fig. 1.

The pedestrian re-identification method based on the attitude and attention mechanism comprises the steps of S10-S70, wherein the steps are described in detail as follows:

in step S10, an image of a pedestrian to be recognized is acquired as a first image.

Common pedestrian re-identification datasets are DukeMTMC-reiD, Market1501, CUHK03, MSMT17, LPW, etc.

In one embodiment of the invention, two pedestrian data sets, namely Market-1501 and DukeMTMC-reiD, are selected as the images of the pedestrians to be identified.

And step S20, extracting pedestrian attitude information of the first image by adopting an attitude estimation network, and generating pedestrian key points.

Pose information may be extracted using a pose estimation network alphapos or openpos pre-trained on the COCO dataset. In one embodiment of the invention, the attitude estimation network AlphaPose is trained in advance on the COCO data set to extract attitude information and generate key points of pedestrians.

The pedestrian key points include:

nose, left eye, right eye, left ear, right ear, left shoulder, right shoulder, left elbow, right elbow, left wrist, right wrist, left waist, right waist, left knee, right knee, left ankle, right ankle.

And step S30, based on the pedestrian key points, deleting the redundant background information of the first image, and correcting a pedestrian detection frame to obtain a second image.

The redundant background information of the first image is:

four areas of the pedestrian in the first image are up, down, left and right.

And image preprocessing is carried out, redundant background information is removed, a pedestrian detection frame is corrected, the alignment of pedestrians is facilitated, and the identification precision is further improved. In one embodiment of the invention, the processed second image has a size of 384 × 128 × 3.

Step S40, generating a feature map from the second image through a feature extraction network to obtain a first feature map; and generating a hard attention diagram with the same size as the feature diagram by Gaussian transformation, binarization and normalization of the pedestrian key points.

In one embodiment of the invention, a preprocessed 384 × 128 × 3 second image is input into a feature extraction network ResNet-50 to generate a 2048 × 24 × 8 first feature map, which represents that the feature map has 24 × 8 dimensions and 2048 channels; the gaussian is gaussian with the key points of 17 pedestrians as the center to generate 17 gaussian maps, and the standard variance value σ of the gaussian is set to 16. Setting the threshold value to be 0.8, and carrying out binarization to obtain 17 binary images. The 17 binary images were summed and normalized to produce a 24 × 8 hard attention map of the same size as the first feature map.

As shown in fig. 2, a schematic diagram and a visual effect diagram of a hard attention mechanism according to an embodiment of a pedestrian re-identification method based on a posture and attention mechanism of the present invention are provided, in which a posture estimation network is used to extract a pedestrian posture of an input image and generate a pedestrian key point, and the pedestrian key point is gaussian, binarized and normalized to obtain the hard attention diagram, wherein a white dot in the middle of the diagram is a gaussian binary region generated by the pedestrian key point, that is, a hard attention region.

Step S50, fusing the first feature map and the hard attention map to obtain a second feature map, as shown in formula (1):

wherein, F₁Is a first characteristic diagram, F₂Is a second characteristic diagram, Mask_hIn an attempt to achieve a hard attention, the user is forced to,respectively representing element-by-element multiplicationAnd added element by element.

In one embodiment of the invention, a first feature map generated based on a feature extraction network and a hard attention map are fused to generate a second feature map, and the size of the second feature map is 2048 × 24 × 8.

And step S60, acquiring a soft attention map with the same size as the second feature map through a soft attention network, and fusing the soft attention map with the second feature map to obtain a third feature map.

Step S61, obtaining a soft attention map with the same size as the second feature map through a soft attention network, as shown in formula (2):

Mask_s＝Sigmoid(BN(Conv(ReLU(Conv(F₂)))))

formula (2)

Wherein, Mask_sRepresenting a soft attention map, F₂For the second characteristic diagram, Conv represents a 1 × 1 convolution operation, BN represents batch normalization, Sigmoid, ReLU represent activation functions.

Step S62, fusing the obtained soft attention map with the second feature map to obtain a third feature map, as shown in equation (3):

In an embodiment of the present invention, a third feature map is generated by fusing the fused second feature map and the soft attention map, and the size of the third feature map is 2048 × 24 × 8.

As shown in fig. 3, for a schematic diagram of a soft attention mechanism and a visualization effect diagram of a pedestrian re-identification method based on a posture and attention mechanism according to an embodiment of the present invention, an input image is used to extract a feature diagram through a convolutional neural network, and a series of operations such as convolution, ReLU activation, convolution, batch normalization, and Sigmoid activation are performed on the feature diagram through the soft attention network, so as to obtain a soft attention diagram. Wherein ConV represents convolution operation, ReLU represents activation operation, BN represents batch normalization operation, and Sigmoid represents activation operation.

In one embodiment of the invention, the third feature map obtained by final fusion is subjected to global average pooling to obtain 2048-dimensional feature vectors, and the size of the global average pooled kernel is consistent with the height and width of the feature map. And then reducing the dimension of the 2048-dimensional feature vector to obtain a 256-dimensional feature vector, wherein the feature dimension reduction adopts 1 × 1 convolution operation, and then a Batch Normalization layer is used for carrying out feature Normalization operation and a ReLU activation function is used for carrying out nonlinear mapping operation on a feature map.

FIG. 4 is a schematic diagram of a network combining a hard attention mechanism and a soft attention mechanism according to an embodiment of the pedestrian re-identification method based on the posture and attention mechanism of the present invention, wherein F₁、F₂、F₃Respectively a first characteristic diagram, a second characteristic diagram, a third characteristic diagram, G₁、G₂Respectively representing the feature vector after global tie pooling and the feature vector after feature dimensionality reduction; the pose estimation network is trained offline on a pose estimation dataset, the pose information of input pedestrian images is extracted in the schematic diagram, pedestrian key points are obtained, wherein GAP represents global average pooling,

In the training process of the network model, after "performing global average pooling and feature dimension reduction on the third feature map to obtain a feature vector for calculating similarity to realize pedestrian matching, i.e., a feature vector for pedestrian re-recognition" in step S70, a step of supervised training is further provided, in which the method includes:

In one embodiment of the invention, a single-card GPU (Nvidia 1080p) is used in the training process, the Batchsize is set to be 32, the optimizer uses Adam, the number of training rounds is set to be 500, the initial learning rate is set to be 2e-4, the learning rate continuously decreases with the increase of the number of training rounds, and the accuracy rate increases.

The feature vectors are used for representing pedestrians, the same pedestrian can have multiple pictures, the feature vectors extracted from the pictures of the same pedestrian are expected to be close in a vector space, and the feature vectors of different classes are expected to be farther in the vector space as well as better. Different pictures of the same pedestrian have the same pedestrian ID.

Calculating the cross entropy of the pedestrian category predicted by the extracted feature vector and the pedestrian category label corresponding to the feature vector, wherein the formula (4) is as follows:

The triple loss is shown in equation (5):

wherein L is_tripletRepresents the loss function of the triplet in the form of,

representing a feature vector extracted from any reference pedestrian image in the training image set;

representing a feature vector extracted from another image representing the same person as the reference pedestrian as a positive sample;the feature vector extracted from the image representing the other person is used as a negative sample; α represents a threshold of the triplet constraint; p indicates that P IDs exist in one Bathsize, and K indicates that K images are selected from one ID.

In the model testing stage, the obtained 256-dimensional feature vectors are input into the pedestrian images of the query library and the image library, cosine similarity or Euclidean similarity is directly calculated, matching and sequencing are carried out according to the similarity, the pedestrian images with high similarity are more likely to be the same target, the pedestrian images with low similarity are less likely to be the same target, and therefore pedestrian re-identification is achieved.

The pedestrian re-recognition system based on the attitude and attention mechanism comprises an image acquisition module, an attitude extraction module, a correction module, a hard attention diagram generation module, a soft attention diagram generation module, a fusion module, a feature vector acquisition module and an output module;

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process and related description of the system described above may refer to the corresponding process in the foregoing method embodiments, and will not be described herein again.

It should be noted that, the pedestrian re-identification system based on the gesture and attention mechanism provided in the above embodiment is only illustrated by the division of the above functional modules, and in practical applications, the above functions may be allocated to different functional modules according to needs, that is, the modules or steps in the embodiment of the present invention are further decomposed or combined, for example, the modules in the above embodiment may be combined into one module, or may be further split into multiple sub-modules, so as to complete all or part of the above described functions. The names of the modules and steps involved in the embodiments of the present invention are only for distinguishing the modules or steps, and are not to be construed as unduly limiting the present invention.

A storage device of a third embodiment of the present invention has stored therein a plurality of programs adapted to be loaded and executed by a processor to implement the above-described pedestrian re-identification method based on the attitude and attention mechanism.

A processing apparatus according to a fourth embodiment of the present invention includes a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; the program is adapted to be loaded and executed by a processor to implement the above-described pedestrian re-identification method based on the attitude and attention mechanism.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes and related descriptions of the storage device and the processing device described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

Those of skill in the art would appreciate that the various illustrative modules, method steps, and modules described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that programs corresponding to the software modules, method steps may be located in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. To clearly illustrate this interchangeability of electronic hardware and software, various illustrative components and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as electronic hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The terms "first," "second," and the like are used for distinguishing between similar elements and not necessarily for describing or implying a particular order or sequence.

The terms "comprises," "comprising," or any other similar term are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

Claims

1. A pedestrian re-identification method based on attitude and attention mechanisms is characterized by comprising the following steps:

step S10, acquiring a pedestrian image to be recognized as a first image;

2. The pedestrian re-identification method based on the attitude and attention mechanism according to claim 1, wherein the redundant background information of the first image is:

3. The pedestrian re-identification method based on the attitude and attention mechanism according to claim 1, wherein in step S50, "fuse the first feature map and the hard attention map to obtain a second feature map", the method comprises:

4. The pedestrian re-identification method based on the posture and attention mechanism according to claim 1, wherein in step S60, "obtaining the soft attention map with the same size as the second feature map through the soft attention network and fusing the soft attention map with the second feature map to obtain a third feature map" is performed by:

Mask_s＝Sigmoid(BN(Conv(ReLU(Conv(F₂)))))

5. The pedestrian re-identification method based on the attitude and attention mechanism according to claim 1, wherein in step S70, after "global average pooling and feature dimensionality reduction on the third feature map to obtain feature vectors for pedestrian re-identification", there is further provided a step of enhancing identification, and the method comprises:

6. The pedestrian re-identification method based on the attitude and attention mechanism according to claim 6, wherein the cross entropy penalty is:

7. The pedestrian re-identification method based on the attitude and attention mechanism according to claim 1, wherein the triplet penalty is:

wherein L is_tripletRepresents the loss function of the triplet in the form of,

representing a feature vector extracted from another image representing the same person as the reference pedestrian as a positive sample;

the feature vector extracted from the image representing the other person is used as a negative sample; α represents a threshold of the triplet constraint; p indicates that P IDs exist in one Bathsize, and K indicates that K images are selected from one ID.

8. A pedestrian re-recognition system based on a posture and attention mechanism is characterized by comprising an image acquisition module, a posture extraction module, a correction module, a hard attention diagram generation module, a soft attention diagram generation module, a fusion module, a feature vector acquisition module and an output module;

9. A storage device having stored thereon a plurality of programs, wherein the programs are adapted to be loaded and executed by a processor to implement the method of pedestrian re-identification based on the gesture and attention mechanism of any one of claims 1 to 7.

10. A treatment apparatus comprises

A processor adapted to execute various programs; and

a storage device adapted to store a plurality of programs;

wherein the program is adapted to be loaded and executed by a processor to perform:

a pedestrian re-identification method based on attitude and attention mechanisms as claimed in any one of claims 1 to 7.