CN111738143A

CN111738143A - Pedestrian re-identification method based on expectation maximization

Info

Publication number: CN111738143A
Application number: CN202010567949.5A
Authority: CN
Inventors: 周非; 陈文峰
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2020-06-19
Filing date: 2020-06-19
Publication date: 2020-10-02
Anticipated expiration: 2040-06-19
Also published as: CN111738143B

Abstract

The invention relates to a pedestrian re-identification method based on expectation maximization, and belongs to the field of computer vision application. Firstly, extracting intermediate features of input pedestrians by using a residual convolutional neural network ResNet50 as a backbone network for feature extraction; constructing an attention module, capturing correlation information among different regions by the characteristics through covariance operation in Non-Local operation in the module, and then performing attention sparse reconstruction on the characteristics by adopting an EM (effective electromagnetic radiation) algorithm, so that the redundancy degree of the characteristics is reduced in the process of mining latent variables in the characteristics, and the characterization capability of effective characteristic information is enhanced; and performing joint training on the network by adopting the triple loss function, the cross entropy loss function and the central loss function. The invention can capture the characteristics with stronger identification degree; and the redundancy degree of the features can be well reduced, an attention feature map with low-rank features is obtained, and the recognition rate is further improved.

Description

Pedestrian re-identification method based on expectation maximization

Technical Field

The invention belongs to the field of computer vision, and relates to a pedestrian re-identification method based on expectation maximization.

Background

Pedestrian Re-Identification (Re-ID) is also called cross-border tracking, is one of important research contents in the field of computer vision, and plays a vital role in the fields of video monitoring, intelligent security, pedestrian identity verification, human-computer interaction and the like. The pedestrian re-identification aims to retrieve pedestrian images of the same identity from a large-scale pedestrian gallery given one inquired pedestrian image in scenes of different visual angles, time and places shot by non-overlapping multiple cameras. Compared with face recognition, the pedestrian re-recognition scene is closer to the real environment, but is more easily influenced by illumination change, pedestrian posture change, background switching, different shooting angles and the like, and great challenges are brought to the pedestrian re-recognition.

At present, the research of pedestrian re-identification mainly comprises two ideas: a method of feature-based representation learning and a method of metric-based learning. The feature representation based learning approach treats pedestrian re-identification as a classification problem, i.e., classifying pedestrians of each identical ID into one class. Therefore, the main task of such a method is to learn more discriminative features from each ID pedestrian image, thereby reducing the difficulty of classification. The metric learning-based method measures semantic similarity between model-embedded features by mapping high-dimensional pedestrian images to low-dimensional feature spaces, so that the intra-class distance between features is reduced and the inter-class distance is increased. The traditional feature expression learning method describes features by manually designing feature descriptors, and features extracted based on deep learning have higher identification capability compared with manual features due to the development of deep learning in recent years. But the neural network treats the features obtained through the automatic learning of the hierarchical structure equally, in fact, the effects of different features on the pedestrian re-identification task are different, and for example, the correlation relation among feature areas has a gain on the feature characterization capability, which is often ignored by the common convolutional network.

Attention mechanisms can cause neural networks to reallocate computational resources, allocating them to more important tasks. On the task of re-identifying the pedestrians, the attention mechanism mainly focuses on capturing information which is meaningful to the task, enhances the characteristic capability of the features, and reduces interference caused by useless information such as background and shielding. The document "Hu, Jie et al, Squeeze-and-Excitation Networks [ J ]. IEEE Transactions on Pattern Analysis and machine insight, 2017" proposes the correlation between modeling feature channels, and screens out the feature of the channel with the largest response, which provides a certain idea for the development of the subsequent attention mechanism. Self-attention based approaches are also gaining increasing use in computer vision tasks. The self-attention mechanism represents the response at a location in the picture by focusing on all locations of the feature map and taking its weighted average in the embedding space. For example, the document "Xialong Wang et al, Non-Local neural networks in CVPR, pages 7794-.

To sum up, the problem that exists at present in pedestrian heavy identification technical field is: 1) in pedestrian re-identification, the image resolution of a data set is low, and the extracted characteristic representation force is insufficient, so that the re-identification precision is low; 2) in pedestrian re-identification, the extracted features are high-dimensional, and the classification boundary is too complex; 3) in pedestrian re-identification, although the network subjected to self-attention modeling can increase the area associated information, the redundancy degree of other characteristics is increased.

Disclosure of Invention

In view of the above, the present invention provides a pedestrian re-identification method based on expectation maximization, which introduces covariance as Non-Local operation of correlation operation, performs correlation modeling on each region of a feature map, and introduces an em (expectation maximization) algorithm to perform low-rank reconstruction on features, so as to maximally mine information with the most discriminative power, i.e., attention information, in redundant features, aiming at the problems of insufficient feature characterization capability and high redundancy, which are extracted by a neural network.

In order to achieve the purpose, the invention provides the following technical scheme:

a method of pedestrian re-identification based on expectation maximization, the method comprising the steps of:

s1: carrying out different preprocessing operations on input training and testing images;

s2: constructing a ResNet50 backbone network, dividing ResNet50 into four stages of Stage1-4, and sequentially extracting characteristic information from shallow to deep;

s3: constructing an attention module, wherein the input and output dimensions of the attention module are consistent, and the attention module can be inserted into the Stage-2 and Stage3 stages of ResNet50, and the attention module comprises two parts: the Non-Local operation and the EM algorithm which use the covariance as a correlation function carry out the operation of reconstructing the characteristics;

s4: after the backbone network ResNet50 extracts features, the network is split into two branches: global Branch and Local Branch, wherein the global Branch extracts the complete features of pedestrians, and the Local Branch extracts the features after feature erasing operation;

s5: respectively training the feature vectors of the training set extracted by the two branches by utilizing the triple loss function, the Cross Entropy Cross Engine loss function and the Center loss function;

s6: inputting the pedestrian image set of Gallery into a model trained in S5, thereby obtaining a pedestrian feature database, wherein each feature in the database corresponds to a unique pedestrian ID;

s7: and (4) inputting a Query image into the CNN model to obtain an input feature, performing similarity measurement on the feature and pedestrian features in the feature library in S6, sorting the features from large to small according to the similarity, and returning to the pedestrian images of the quantity specified by the user.

Optionally, the preprocessing operation in step S1 includes:

random horizontal flipping, i.e., flipping the input set of images with a given probability;

image rotation, namely rotating an input pedestrian image at a certain angle;

color enhancement, i.e., randomly altering the intensity of each channel of the input RGB image.

Optionally, in step S2, in two stages of Stage3 and Stage4 of the backbone network ResNet50, a hole convolution is performed to convolve the features, so as to obtain a larger feature map and obtain sufficient feature information.

Optionally, in step S3, the construction of the attention module is divided into two stages:

stage 1: performing Non-Local calculation on the input features, wherein the correlation is obtained by calculating the covariance among pixels, and a Non-Local core operator is as follows:

where x is the input feature map, f (·,) function calculates the correlation between pixel i and pixel j, g (x)_j) The function calculates the mapping of the feature map on pixel j, C (x) represents the normalization coefficient, y_iRepresenting the weighted average of all other pixels except the i pixel after the g function transformation, wherein the weight is a normalized similarity function;

and (2) stage: acquiring rich related information between the regions through second-order statistic covariance, bringing a part of high-redundancy characteristics, and performing sparse reconstruction on the redundancy characteristics by adopting an EM (effective noise) algorithm; the EM algorithm assumes X ═ X₁,x₂,…,x_NIs the obtained feature information set, which is composed of N observation samples, each data point x_iAll have corresponding potential information z_iI.e. the most characteristic information of the force; { X, Z } is the complete data with a likelihood function of lnp (X, Z | θ), where θ is the set of all parameters in the model; in fact the knowledge of the underlying information in Z is derived from the posterior distribution p (X, Z | θ); the EM algorithm maximizes the likelihood of lnp (X, Z | θ) by two operations, expectation (E) and maximization of expectation (M);

E：Q(θ,θ⁽ⁱ⁾)＝E_Z[lnp(X,Z|θ)|X,θ⁽ⁱ⁾]

＝∑_Zlnp(X,Z|θ)P(Z|X,θ⁽ⁱ⁾)

wherein p (Z | X, theta)⁽ⁱ⁾) Is to estimate theta at the given characteristic information data X and the ith parameter⁽ⁱ⁾Implicit variable data Z, i.e. the probability distribution of the attention information; m, updating parameters by maximizing the expectation obtained in the step E to obtain a parameter estimation value theta of the (i + 1) th iteration⁽ⁱ⁺¹⁾：

Optionally, in step S4, after extracting features from the Global Branch, pooling each feature map into 2048 × 1 feature vectors through a Global average pooling layer GAP, and then reducing the feature vectors into 512 × 1 vectors through the features; the Local Branch adopts Batch DdropLock to erase the same region of each Batch of input features in a certain proportion, then a global maximum pooling layer GMP is used for replacing a global average pooling layer to generate 2048-dimensional maximum feature vectors, and the Local Branch features become 512-dimensional after dimensionality reduction.

Optionally, in step S5, a plurality of loss functions are jointly trained, and the triplet loss function minimizes the distance between any target sample and the positive sample and maximizes the distance between any target sample and the negative sample, where the formula is as follows:

wherein the content of the first and second substances,

representing the distance between the target sample and the positive sample,

representing the distance between the target sample and the negative sample, m being the threshold for loss of the triplet;

the cross entropy describes the distance between two probability distributions, and when the cross entropy is smaller, the two probability distributions are closer to each other, the formula is as follows:

wherein K belongs to {1,2, …, K } represents the pedestrian class output by the pedestrian re-identification network, p (K) represents the prediction probability of the input image belonging to the class K, and q (K) represents the actual probability;

the central loss function can reduce the distance between samples of the same type, so that the similarity of the samples is increased, and the formula is as follows:

wherein c is a sample class center; the final loss function is a weighted sum of the above three loss functions, namely:

L_total＝L_triplet+γ_iL_id+γ_cL_center。

the invention has the beneficial effects that:

(1) the invention obtains the correlation among the characteristic areas through covariance operation, can bring rich second-order statistical information to the characteristics, and enhances the characterization force of the characteristics.

(2) According to the method, the characteristics are reconstructed by using an expectation maximization algorithm, attention information and model parameters are updated through E and M two-step multiple iteration in the reconstruction process, the characteristics are reconstructed by using the converged attention information and the model parameters after a convergence state is finally achieved, and the reconstructed characteristics have lower redundancy compared with the original characteristics;

(3) experimental results show that compared with the traditional space attention and channel attention, the method has higher re-identification precision.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention may be realized and attained by the means of the instrumentalities and combinations particularly pointed out hereinafter.

Drawings

For the purposes of promoting a better understanding of the objects, aspects and advantages of the invention, reference will now be made to the following detailed description taken in conjunction with the accompanying drawings in which:

fig. 1 is a schematic general flow chart of a pedestrian re-identification system according to an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of an attention module according to an embodiment of the present invention;

FIG. 3 is a visual attention diagram of the attention module of the present invention;

FIG. 4 is a CMC curve comparison graph of the algorithm of the present invention under Market1501, DukeMTMC data sets;

FIG. 5 is a graph comparing CMC curves under the CUHK03-labeled data set by the algorithm of the present invention;

FIG. 6 is a graph comparing CMC curves under CUHK 03-detected data set by the algorithm of the present invention.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention in a schematic way, and the features in the following embodiments and examples may be combined with each other without conflict.

Wherein the showings are for the purpose of illustrating the invention only and not for the purpose of limiting the same, and in which there is shown by way of illustration only and not in the drawings in which there is no intention to limit the invention thereto; to better illustrate the embodiments of the present invention, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The same or similar reference numerals in the drawings of the embodiments of the present invention correspond to the same or similar components; in the description of the present invention, it should be understood that if there is an orientation or positional relationship indicated by terms such as "upper", "lower", "left", "right", "front", "rear", etc., based on the orientation or positional relationship shown in the drawings, it is only for convenience of description and simplification of description, but it is not an indication or suggestion that the referred device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and therefore, the terms describing the positional relationship in the drawings are only used for illustrative purposes, and are not to be construed as limiting the present invention, and the specific meaning of the terms may be understood by those skilled in the art according to specific situations.

Fig. 1 is a schematic flow chart of a pedestrian re-identification method based on an expectation-maximization algorithm according to an embodiment of the present invention, as shown in the figure, the method includes the following steps:

s1: different pre-processing operations are performed on the input training and test image sets.

In the step, the input training and testing data sets are three public pedestrian data sets including Market-1501, DukeMMC-reiD and CUHK 03.

The Market1501 data set is a rectangular frame containing 1501 different pedestrians and 32668 detected pedestrians photographed by 6 cameras at Qinghua university. 751 pedestrians in the training set and 12936 images; the test set had 750 people, and contained 19732 images. During testing, 3368 images containing 750 pedestrians are used as a query set to identify the correct pedestrian identity in the test set.

The DukeMMC-reID dataset contains 36411 pedestrian images of 1812 identities taken by 8 cameras. The training set contains 702 pedestrians with different identities, and 16522 training images; the test set contained 17661 test images; the query image set consists of 2228 images of another 702 identities.

The CUHK03 dataset contained 14097 images of 1467 identities. This dataset provides two kinds of bounding boxes, namely the manually labeled bounding box and the bounding box detected by the DPM, which are respectively labeled as two sets of "labeled" and "detected". 767 identity pedestrians are in the training set, and 700 identity pedestrians are in the testing set. CUHK03 has two testing protocols, and the present invention uses a new testing protocol, similar to Market-1501, which divides the data set into training set containing 767 pedestrians and testing set containing 700 pedestrians.

TABLE 1 summary of the experimental data set

Data set	Time of day	Number of pedestrians	Number of images	Camera number
					Market1501	2015	1501	32668	6
DukeMTMC-reID	2017	1812	36441	6
					CUHK03	2014	1467	13164	10

The present invention uses two evaluation criteria to evaluate the performance of the model across all data sets. The first evaluation criterion is a Cumulative Matching Characteristics (CMC) curve, which represents the probability value of finding the correct match among the first k Matching results. The CMC regards the pedestrian re-identification problem as a sorting problem, and is represented by Rank-k, and if the identification rate of the Rank-k is P, the probability that the correct target is k before the ranking result is P. The second evaluation criterion is the Mean Average Precision (mAP), which considers the pedestrian re-identification problem as the target retrieval problem. The mAP can evaluate the overall performance of the model.

s3: constructing an attention module, wherein the input and output dimensions of the attention module are consistent, inserting the Stage-2 and Stage3 stages of ResNet50,

the specific steps are that the attention module comprises two stages of F and B, namely Non-Local operation using covariance as a correlation function and operation of reconstructing characteristics by an EM algorithm. Fig. 2 is a schematic structural diagram of an attention module according to an embodiment of the present invention. Since there is some correlation between various parts of the pedestrian's body, the F part introduces second order statistics covariance to capture the correlation between non-local regions in the feature map space.

Given input feature map X ∈ R^h×w×cH and w are the height and width of the characteristic diagram respectively, c represents the channel number of the characteristic diagram, and the space dimension of X is compressed to one dimension to become X ∈ R^hw×cThen, constructing two functions of theta (x) and g (x) through a 1 × 1 convolution, a batch normalization layer and a ReLU activation function, thereby obtaining two dimensions of

Wherein r is the feature channel reduction factor. The covariance matrix is then calculated using θ (x), and the formula is as follows:

wherein

I is an identity matrix and is a matrix of the identity,

has the dimension of hw × hw. will

Multiplying the scaling factor by the covariance matrix, then performing matrix multiplication with g (X) through a softmax function to obtain X':

considering that the second-order statistics is introduced in the F stage, a large amount of redundant feature information is brought, and negative effects are brought to the pedestrian re-identification task. For this reason, an Expectation Maximization (EM) algorithm is introduced to reconstruct the characteristics output by the F part by using a small number of characteristic descriptors, and the reconstructed characteristics have low rank characteristics.

The stage B consists of three steps, namely expectation (E) operation, maximization (M) operation and feature reconstruction operation, the EM algorithm is used for solving the maximum likelihood solution containing a hidden variable model, the hidden variable is taken as a mapping matrix Z, model parameters are K descriptors, and the feature map input by the stage B is X' ∈ R^hw×c/rThe initial value of the descriptor is u ∈ R^k×c/rStep E, updating the mapping matrix Z ∈ R^hw×k(attention is sought), as shown in the following equation:

Z＝softmax(λX'(u^T))

wherein, λ is taken as a hyper-parameter to control the distribution of Z, and the default value is 1.

M steps update the descriptor u (parameter), where u is calculated as the weighted average of X', and the kth descriptor is updated as:

and E and M alternately execute T steps until u and Z approximately converge. At this time, u and Z are used to re-estimate X' to obtain X ″, i.e.:

X”＝Zu

finally reconstructed X "∈ R^hw×c/rThe number of channels is recovered by a convolution 1 × 1 and added to the most original feature map X to obtain X':

X”'＝X+X”

the attention module pseudo code is as follows:

TABLE 2 attention Module Algorithm framework

S4: after the backbone network ResNet50 extracts features, the network is split into two branches: global Branch (global Branch) and Local Branch (Local Branch), wherein the global Branch extracts the complete features of pedestrians, and the Local Branch extracts the features after feature erasing operation.

S5: respectively training the training set feature vectors extracted by the two branches by utilizing a triple loss function, a Cross Entropy loss function and a Center loss function;

Fig. 3 shows an attention feature diagram generated after iterative convergence of the EM algorithm in the attention module of the present invention, and it can be seen from the diagram that the EM algorithm can guide the attention of the model to the pedestrian through iteration, while ignoring the interference caused by the background information to some extent. And the feature descriptors mu generated in the iteration process are mutually orthogonal, so that the redundancy of the features can be reduced, and the highest accuracy can be obtained when the number K of the feature descriptors is 160 and the iteration number T is 3 through experimental verification.

The network performance of the invention is verified by experiments after adding no attention module, F and B and adding complete attention module, and the verification results are shown in tables 3 and 4.

TABLE 3 comparison of attention Module splitting experiments on DukeMTMC-reiD and Market501 data sets

TABLE 4 comparison of attention Module splitting experiments on CUHK03 dataset

From the two tables, the network precision is gained to a certain extent in both the F stage and the B stage, the average precision mean (mAP) and the first hit rate (rank1) are respectively improved by 1.1% and 1.0% after the F is added to the Duke MTMC-reiD data set, and this proves that the feature information extracted after the introduction of the covariance has stronger expression capability compared with the original feature. After separately introducing B and processing the characteristics by using the EM algorithm, the mAP and rank1 are improved by 1.7 percent and 1.5 percent compared with the original network, which shows that the characteristics reconstructed by using the EM algorithm have certain effectiveness on model optimization. Compared with the mAP and rank1 fused in the two stages of F and B, the mAP and rank1 are improved after being separated independently, and are respectively 78.8% and 89.4%. Similar to the results on the DukeMTMC-reiD dataset, the attention module on Market1501 and CUHK03 can also bring accuracy improvements to the underlying network.

FIGS. 4, 5, 6 show the recognition rates of the proposed attention module compared on the three data sets DukeMTMC-reiD, Market501, and CUHK 03. It can be seen from the figure that when the recognition rate of the attention module in both the F-stage and the B-stage is improved to different degrees, the accuracy of the improvement of the complete attention module after the two-stage fusion is the highest.

The above embodiments are provided only for illustrating the present invention and not for limiting the present invention, and those skilled in the art should make various changes or modifications without departing from the spirit and scope of the present invention.

Finally, the above embodiments are only intended to illustrate the technical solutions of the present invention and not to limit the present invention, and although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions, and all of them should be covered by the claims of the present invention.

Claims

1. A pedestrian re-identification method based on expectation maximization is characterized by comprising the following steps: the method comprises the following steps:

2. The expectation-maximization-based pedestrian re-identification method according to claim 1, wherein: in the step S1, the preprocessing operation includes:

image rotation, namely rotating an input pedestrian image at a certain angle;

3. The expectation-maximization-based pedestrian re-identification method according to claim 2, wherein: in step S2, the features are convolved by using hole convolution rules at two stages, namely Stage3 and Stage4, of the backbone network ResNet50, so as to obtain a larger feature map and obtain sufficient feature information.

4. The expectation-maximization-based pedestrian re-identification method according to claim 3, wherein: in step S3, the construction of the attention module is divided into two stages:

and (2) stage: acquiring rich related information between the regions through second-order statistic covariance, bringing a part of high-redundancy characteristics, and performing sparse reconstruction on the redundancy characteristics by adopting an EM (effective noise) algorithm; the EM algorithm assumes X ═ X₁,x₂,…,x_NIs the obtained feature information set, which is composed of N observation samples, each data point x_iAll have corresponding potential information z_iI.e. the most characteristic information of the force; { X, Z } is the complete data with a likelihood function of lnp (X, Z | θ), where θ is the set of all parameters in the model(ii) a In fact the knowledge of the underlying information in Z is derived from the posterior distribution p (X, Z | θ); the EM algorithm maximizes the likelihood of lnp (X, Z | θ) by two operations, expectation E and maximization of expectation M;

E：Q(θ,θ⁽ⁱ⁾)＝E_Z[lnp(X,Z|θ)|X,θ⁽ⁱ⁾]

＝∑_Zlnp(X,Z|θ)P(Z|X,θ⁽ⁱ⁾)

M：

5. The expectation-maximization-based pedestrian re-identification method according to claim 4, wherein: in step S4, after extracting features from the Global Branch, pooling each feature map into 2048 × 1 feature vectors through a Global average pooling layer GAP, and then reducing the feature vectors into 512 × 1 vectors through features; the Local Branch adopts Batch DdropLock to erase the same region of each Batch of input features in a certain proportion, then a global maximum pooling layer GMP is used for replacing a global average pooling layer to generate 2048-dimensional maximum feature vectors, and the Local Branch features become 512-dimensional after dimensionality reduction.

6. The expectation-maximization-based pedestrian re-identification method according to claim 5, wherein: in step S5, a plurality of loss functions are jointly trained, and the triplet loss function minimizes the distance between any target sample and the positive sample and maximizes the distance between any target sample and the negative sample, and the formula is as follows:

wherein the content of the first and second substances,

representing the distance between the target sample and the positive sample,

L_total＝L_triplet+γ_iL_id+γ_cL_center。