CN113378729A

CN113378729A - Pose embedding-based multi-scale convolution feature fusion pedestrian re-identification method

Info

Publication number: CN113378729A
Application number: CN202110667913.9A
Authority: CN
Inventors: 廖开阳; 雷浩; 郑元林; 章明珠; 范冰; 黄港
Original assignee: Xian University of Technology
Current assignee: Xian University of Technology
Priority date: 2021-06-16
Filing date: 2021-06-16
Publication date: 2021-09-10

Abstract

The invention discloses a pose embedding-based multi-scale convolution feature fusion pedestrian re-identification method, which comprises the following steps of: preprocessing an original pedestrian image by adopting a random erasing mode to obtain a pedestrian image, optimizing a baseline network of a Resnet-50 network model, and extracting a deep convolution characteristic; extracting a significant human body image from an original pedestrian image; firstly, extracting the postures of the human body saliency image, and then extracting the local semantic features of the human body position image; carrying out weighted fusion on the depth convolution characteristics and the local semantic characteristics, and carrying out distance measurement on the weighted fusion characteristics to generate an initial measurement list; and reordering the images in the initial measurement list according to a reordering algorithm to obtain an image correct matching ranking, and outputting a pedestrian matching image to identify a specific pedestrian. The accuracy of identification and positioning can be greatly improved.

Description

Pose embedding-based multi-scale convolution feature fusion pedestrian re-identification method

Technical Field

The invention belongs to the technical field of image processing methods, and relates to a pose embedding-based multi-scale convolution feature fusion pedestrian re-identification method.

Background

In recent years, artificial intelligence is the most important point for the development of science and technology, and the technology used by people is the most important point. Its use in the intelligent monitoring domain has also become extremely important. With the expansion of cities, monitoring systems are further popularized, and each city has thousands of cameras all over the street head and the street tail. The use of cameras is increasing and the cost of relying solely on human monitoring is extremely expensive and not capable of monitoring so many pictures at the same time. Therefore, the pedestrian re-identification technology attracts the attention of researchers. It can help people to monitor, track and identify pedestrians. Since human beings receive and perceive various kinds of external information mainly through visual technology, and the visual technology owned by human beings can directly obtain required information from cumbersome images. Researchers also desire to have cameras that effectively and quickly capture objects in an environment that mimic the human visual system. This technology is ultimately derived into our present pedestrian re-identification technology. The technology of pedestrian re-identification is widely used, and for example, an intelligent monitoring system needs to use the technology of pedestrian re-identification. The technology processes data by using the powerful capability of a computer, for example, a video monitoring system can automatically filter out some useless information and actively identify a human body, so that the comprehensive monitoring is effectively carried out, and a 24-hour monitoring system with early warning and later evidence obtaining can be realized. Also using this technique is pedestrian traffic statistics. It also borrows the powerful ability of computers to process data, automatically filters out some useless information, and automatically identifies pedestrians and counts. Meanwhile, pedestrians appearing in different areas for many times can not be counted repeatedly, so that the pedestrian flow can be effectively and accurately counted.

The pedestrian re-identification precision is influenced by a key factor, namely the dislocation of the pedestrian, and the mutual shielding of all parts of the body of the pedestrian and the continuous change of the posture caused by the dislocation are a great challenge for the research of the pedestrian re-identification. First, the pedestrian is constantly changing in posture during the course of movement, and the pedestrian inevitably changes in posture, which means that local changes in the body are unpredictable in the bounding box. For example, a pedestrian may place his hand behind or on the top of his head during movement, causing local occlusion due to misalignment, which has a large impact on the extracted features. Secondly, detecting when the arrangement of the pedestrians is irregular can affect the accuracy of the pedestrian re-identification research. One method commonly used in the area of pedestrian re-identification is to divide the bounding box into horizontal stripes, however this method can only be built with slight vertical deviations. When the vertical deviation is misaligned, the detection of the body and the head may be matched with the background, resulting in erroneous recognition of the pedestrian re-recognition task. The horizontal striping approach is therefore not ideal in the case of severe misalignment. In the case of a pedestrian with a changing posture, the background changes, so that the background may be wrongly weighted by the convolutional neural network to influence the recognition accuracy. Therefore, how to solve the influence of dislocation and background change caused by the change of the pedestrian posture is the key for improving the accuracy of pedestrian re-identification.

Disclosure of Invention

The invention aims to provide a pose embedding-based multi-scale convolution feature fusion pedestrian re-identification method, which solves the problem of low pedestrian re-identification precision caused by dislocation and background change caused by pedestrian attitude change in the prior art.

The invention adopts the technical scheme that a pose embedding-based multi-scale convolution feature fusion pedestrian re-identification method comprises the following steps:

step 1, preprocessing an original pedestrian image by adopting a random erasing mode to obtain a pedestrian image, performing baseline network optimization on a Resnet-50 network model, and inputting the pedestrian image into the optimized Resnet-50 network model to obtain a deep convolution characteristic;

step 2, taking the original pedestrian image as an input image to perform feature extraction to obtain a significant human body image;

step 3, firstly adopting a posture convolver to extract the postures of the human body saliency images to obtain body position images, and then inputting the body position images into a ResNet-50 network to extract local semantic features;

step 4, carrying out weighted fusion on the depth convolution characteristics and the local semantic characteristics to obtain weighted fusion characteristics, respectively measuring the distances between the images in the image test library and the image query library and the fusion characteristics, and generating an initial measurement list for the result after distance measurement;

and 5, reordering the images in the initial measurement list according to a reordering algorithm to obtain a correct image matching ranking, and outputting a pedestrian matching image to identify a specific pedestrian.

The invention is also characterized in that:

the specific mode for carrying out the baseline network optimization on the Resnet-50 network model is as follows:

and optimizing a loss function of the Resnet-50 network model by combining Softmax loss and triple loss, wherein the optimized loss function is as follows:

in the above formula, m is the number of loss functions;

in the above formula, the first and second carbon atoms are,

is the feature vector of the anchor point sample,

is the feature vector of the positive sample,

is a feature vector of negative samples, alpha is

A distance between

The distance between them is the smallest distance, + represents [, ]]When the value of the internal is more than zero, the value is a loss value, and when the value is less than zero, the loss is zero.

The step 2 specifically comprises the following steps:

step 2.1, removing the last pooling stage of the VGG-16 network structure to be used as a network structure, inputting an original pedestrian image serving as an input image into the network structure, and outputting feature mapping;

step 2.2, deconvoluting the feature mapping into the size of the input image, and adding a new convolution layer to generate a prediction significance map;

and 2.3, firstly, applying the convolutional layer with the core size of 1 multiplied by 1 in the network structure to the conv1-2 layer to generate boundary prediction, then adding the boundary prediction to the prediction significance map to obtain a refined boundary frame, and then, applying one convolutional layer to carry out convolution on the refined boundary frame to obtain a significant human body image.

The step 3 specifically comprises the following steps:

step 3.1, taking the significant human body image as the input of a posture estimator, and positioning 14 joint points;

3.2, positioning 14 human body joints into 6 sub-areas, cutting, rotating and adjusting the sizes of the 6 sub-areas to fixed sizes and directions, and combining to form a spliced body part image;

step 3.3, carrying out pose transformation on the size of each body part in the spliced body part image to obtain a body part image;

and 3.4, inputting the body part image into a ResNet-50 network for training, and extracting local semantic features.

The specific process of the step 5 is as follows:

testing image p and image set G ═ G for a pedestrian_iCoding k-reciprocal nearest neighbor into a single vector through weighting to form k-reciprocal characteristics, then calculating Jacobian distances between a pedestrian test image p and an image set by utilizing the k-reciprocal characteristics of the image, and finally weighting the original distances between the pedestrian test image p and the image set and the Jacobian distances to obtain a distance formula; and calculating the distance between the image and the fusion feature in the initial measurement list according to a distance formula, reordering to obtain a correct image matching ranking, and outputting a pedestrian matching image to identify a specific pedestrian.

The invention has the beneficial effects that:

the invention relates to a pose embedding-based multi-scale convolution feature fusion pedestrian re-identification method, which comprises the steps of fusing a depth global feature and a local semantic feature, measuring distances between different images through fused weighted features, identifying and retrieving the images of the same pedestrian, and using the pose embedding-based multi-scale convolution feature fusion pedestrian re-identification method to identify and retrieve the images of the pedestrian in an original image database to obtain the image of the specific pedestrian, so that the pose embedding-based multi-scale convolution feature fusion pedestrian re-identification method is better suitable for a pose embedding-based multi-scale convolution feature fusion pedestrian re-identification system; the performance of the baseline network is improved by the methods of random erasure and triple loss functions, and the local features obtained by attitude estimation and extraction and the global features obtained by the baseline network are used for carrying out feature weighted aggregation, so that the aim of global optimization is fulfilled, the target identification and positioning are facilitated, the operation speed of the algorithm is increased, and the stability of the system is improved; the method can greatly improve the accuracy of identification and positioning, and can be used for identifying and searching the target of the pedestrian image and in other fields.

Drawings

FIG. 1 is a flow chart of a pose embedding-based multi-scale convolution feature fusion pedestrian re-identification method;

FIG. 2 is a diagram of the effect of random erasure processing of a pose embedding-based multi-scale convolution feature fusion pedestrian re-identification method of the invention;

FIG. 3 is a triple loss schematic diagram of a pose embedding-based multi-scale convolution feature fusion pedestrian re-identification method of the invention;

FIG. 4 is a pose embedding effect diagram of a pose embedding-based multi-scale convolution feature fusion pedestrian re-identification method.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

A pose embedding-based multi-scale convolution feature fusion pedestrian re-identification method is shown in figure 1 and comprises the following steps:

step 1, establishing an image database, wherein the image database is a pedestrian image which is manually collected and corrected by using a computer in the embodiment, and the total number of the images is 72000. Preprocessing an original pedestrian image by adopting a random erasing mode to obtain a pedestrian image, performing baseline network optimization on the Resnet-50 network model, and inputting the pedestrian image into the optimized Resnet-50 network model to obtain a deep convolution characteristic;

step 1.1, randomly erasing an original pedestrian image by adopting a random erasing enhancement processing method to obtain a pedestrian image;

specifically, Random erase enhancement (REA) is an effective data enhancement method. The method aims to shield different training images, randomly generate a rectangular area in the images, randomly generate the position and the size of the rectangular area, shield partial pedestrian images, and set the pixel value of the image shielding area as a random value. By the method, the occurrence of over-fitting can be reduced, and the convergence capability of the network model is improved, so that the performance of the deep learning model is improved.

In the network model training, for an original training data set, assuming that the probability of random erasure of the original data set is P, the probability of non-erasure is 1-P. In the random erasing process, a rectangular area is generated with a set probability P to shield the image, and the position and the size of the shielded area which are randomly erased and shielded in the process are random.

Assume that an image that needs to be randomly erased, i.e., the size of the original pedestrian image, is:

S＝W×H (1)；

in the above formula, W is the width of the pedestrian image, and H is the height value of the pedestrian image;

assuming that the area size of the rectangular region for random erase is S_eAnd the size of the area is at a minimum value S_lAnd maximum value S_hWithin the specified range. The aspect ratio of the random erasure area is r_eThen randomly erasing the width H of the rectangular area_eAnd height W_eComprises the following steps:

in the above formula, S_eFor the area value of the erased rectangular frame, r_eTo erase the aspect ratio of the rectangular frame, H_eTo erase the height of the rectangular frame, W_eTo erase the width of the rectangular frame.

Randomly selecting a point P ═ x (x) on the original pedestrian image_e，y_e) If the following formula (4) and formula (5) are satisfied:

x_e+W_e≤W (4)；

y_e+H_e≤H (5)；

then the rectangular area to be erased of the original pedestrian image is (x)_e，y_e，x_e+W_e，y_e+H_e) The area to be erased is selected by random erasing, and each pixel in the rectangular area is allocated [0, 255%]The random value in (1) is used to replace the original rectangular region. If the randomly selected point P is (x)_e，y_e) If the conditions of equations (4) and (5) are not satisfied, the above process is repeated all the time, and the image is emphasizedA new point P ═ x is newly selected_e，y_e) Until the appropriate random point is selected. Finally, the original pedestrian image (i.e., the pedestrian image) after being randomly erased is output, as shown in fig. 2.

Step 1.2, optimizing a loss function of the Resnet-50 network model by combining Softmax loss and triple loss;

specifically, in the field of pedestrian re-identification, triple loss (Triplet loss) is also widely applied, and is more applied in a network model together with Softmax loss. As shown in fig. 3, when using the triple loss function, three pictures are taken as input to the network:

wherein

For Anchor samples (Anchor), randomly selecting samples in the data set for training the network model,

training samples representing the identity of pedestrians that belong to the same class as the anchor sample, i.e. positive samples,

training samples representing identities of pedestrians that are not of the same class as the anchor sample, i.e., negative samples. These training samples are input into a similar network structure for feature extraction, as shown in fig. 3, and after learning through Triplet loss, the distance between the original sample and the positive sample is the smallest, and the distance between the original sample and the negative sample is the largest. The final formula for calculating Triplet loss is:

in the above formula, the first and second carbon atoms are,

as anchor samplesThe feature vector of the present invention is,

is the feature vector of the positive sample,

is a feature vector of negative samples, alpha is

A distance between

The distance between them is the smallest distance, + represents [, ]]When the internal value is more than zero, the value is a loss value, and when the internal value is less than zero, the loss is zero;

as can be seen from the objective function: when in use

And

is less than

And

the distance between the two is added with alpha]If the value of (1) is greater than zero, there will be a loss value, when

And

is greater than or equal to

And

the distance between them is addedAt α, the loss value is zero.

Through the triple loss function, the network model can shorten the distance between the pedestrian images with the same label and shorten the distance between the pedestrian images with different labels, so that the trained network model has higher discriminability.

in the above formula, m is the number of loss functions;

and step 1.3, inputting the pedestrian image into the optimized Resnet-50 network model to obtain the deep convolution characteristic.

Step 2, taking the original pedestrian image as an input image to perform feature extraction, and separating the foreground from the background to obtain a significant human body image;

specifically, the VGG-16 model has ideal effects in the aspects of image classification and generalization special effects, so the significance model also uses the VGG-16 to construct a network structure. Given an input image of size WXH, the output map has a size [ W/2 ]⁵，H/2⁵]So a network structure built based on VGG-16 reduces the output by a factor of 32 of feature mapping. In this embodiment, the last pooling stage of VGG-16 is eliminated, so that the size of the input image can be enlarged, and the semantic context and image details can be balanced. Therefore, the feature map output by the network structure of the present invention is reduced by 16 times compared to the input image.

Step 2.2, the integrated feature map already contains various saliency cues so that they can be used to predict saliency maps. Specifically, deconvoluting the feature mapping into the size of the input image, and adding a new convolution layer to generate a prediction significance map;

and 2.3, adding boundary refinement by introducing short connection into the prediction result, further performing boundary refinement to separate the foreground from the background, and expecting that the bottom layer features are helpful for predicting the boundary of the object. Furthermore, these features also have the same spatial resolution for the input image. Specifically, a convolutional layer with the core size of 1 × 1 in the network structure is applied to a conv1-2 layer to generate boundary prediction, the boundary prediction is added to a prediction significance map to obtain a refined boundary frame, and then the refined boundary frame is convolved by applying one convolutional layer to obtain a significant human body image.

And 3, firstly, extracting the postures of the human body saliency images by adopting a posture convolver to obtain body position images, and then inputting the body position images into a ResNet-50 network to extract local semantic features. Specifically, the pose extraction was performed using a ready-made model of a pose convolver, which is a sequential convolution structure that can detect 14 body joints, i.e., the head, neck, left and right shoulders, left and right elbows, left and right wrists, left and right hips, left and right knees, and left and right ankles, as shown in fig. 4.

Step 3.1, taking the significant human body image as the input of a posture estimator, and positioning 14 joint points, wherein the 14 joints are head, neck, right shoulder, right elbow, right wrist, left shoulder, left elbow, left wrist, left hip, left knee, left ankle, right hip, right knee and right ankle;

step 3.2, positioning 14 human joints into 6 sub-regions (head, upper body, left arm, right arm, left leg and right leg) as human body parts, cutting, rotating and adjusting the size of the 6 sub-regions to a fixed size and direction, and combining to form a spliced body part image; due to the different sizes of 6 parts of a human body, black areas inevitably appear in a human body image;

since the black region appears in the stitched body part image, it is necessary to perform pose transformation on the size of each body part to remove the black region, and the size of each body part is determined mainly according to observation. For example, the embodiment observes that the arm width is about 20 pixels and the leg width is about 30 pixels, and decreasing these parameter values will result in information loss, and increasing these parameters may bring more background noise. But as long as the parameter variation is small, the system performance remains stable. The reason for this is that when the part size varies within a small range, the authentication information contained therein does not vary much, so the network can still learn authentication embedding given the monitoring signal.

And 3.4, dividing the body part image into a test set and a training set, inputting a ResNet-50 network for training, and extracting local semantic features. The ResNet-50 network in the step and the ResNet-50 network optimized in the step 1 do not contribute to the weight, but train a new weight alone to judge the local semantic image and extract the local semantic features.

Step 4, carrying out weighted fusion on the depth convolution characteristics and the local semantic characteristics to obtain weighted fusion characteristics, respectively measuring the distances between the images in the image test library and the image query library and the fusion characteristics, generating an initial measurement list ranking for the results after distance measurement, and returning a query score; the feature weighted aggregation is taken as shown in the following formula:

d＝αf_DEEP+(1-α)f_SOD (8)；

in the above formula, the parameter 0 ≦ α ≦ 1 represents different weights between the deep global feature and the local semantic feature.

Specifically, a pedestrian test image p and an image set G ═ G are tested_i1,2, N, coding k-reciprocal nearest neighbor into a single vector by weighting to form k-reciprocal characteristics, then calculating Jacobian distances of a pedestrian test image p and an image set by using the k-reciprocal characteristics of the image, and finally testing the pedestrian test image p and the original distance of the image set and the Jacobian distancesWeighting the distance to obtain the distance; and calculating the distance between the image and the fusion feature in the initial measurement list, sequencing to obtain the correct matching ranking of the image, and outputting a pedestrian matching image to identify the specific pedestrian.

Step 5.1, firstly, a pedestrian image p is given for testing, and an image set G ═ G is given_i1,2, N is used for pedestrian image reference, and the original distance between the pedestrian image p and the reference data set gi is measured by mahalanobis distance, and the measurement result is shown in a formula

In the above formula, x_pTo test the appearance characteristics of the image p,

is a reference image g_iM is a positive semi-definite matrix;

from test image P and reference image g_iThe original distance between the two is obtained after initializing the sorted list:

and 5.2, the purpose of the reordering strategy is to reorder the L (p, G) initial list ranking, so that more correctly matched image samples are arranged at the first position of the list, and the identification precision of pedestrian re-identification is improved.

The top k ranked samples in the initial ranking list, i.e., k neighbors (k-nearest neighbors, k-nn):

the k-reciprocal nearest neighbors (k-reciprocal nearest neighbors, k-rnn) are expressed as:

R(p,k)＝g_i|(g_i∈N(p,k))∧p∈N(g_i,k) (12)；

however, due to a series of influencing factors such as brightness variation, posture variation, view angle variation and occlusion, the correctly matched samples may be excluded from the nearest neighbors. To solve this problem, each candidate nearest neighbor set is converted into a more robust set:

for each test image sample in the original set R (p, k), find their k-reciprocal nearest neighbor set

When the number of the overlapped samples reaches a certain condition, the overlapped samples and the R (p, k) are merged, and more positive samples can be added into the R (p, k) set after expansion;

step 5.3, according to the original distance between the retrieval image and the near neighbor, the weight is redistributed, and the k-inverted nearest neighbor set of the sample image is encoded into an N-dimensional vector through a Gaussian kernel, which is defined as

Expressed as:

based on neighbors being assigned greater weights and distant neighbors being assigned lesser weights, the candidates for intersection and union needed to compute the Jacobian distances may be computed as:

the intersection sets take the minimum value in the corresponding dimensionality of the two feature vectors as the degree that the two feature vectors contain gi together through minimum operation, and the maximum operation of the union set is to count the total set of matching candidates in the two sets;

step 5.4, the final Jacobian distance is expressed as:

and correcting the initial sorted list by combining the original distance and the Jacobi distance, wherein the final distance is defined as:

d^*(p,g_i)＝(1-λ)d_J(p,g_i)+λd(p,g_i) (18)；

in the above formula, λ is a weighting parameter λ representing the weight of two distances, and when λ is 0, only the jacobian distance is considered, and when λ is 1, only the original distance is considered, where λ is set to 0.3;

and 5.5, calculating the distance between the image and the fusion feature in the initial measurement list by using a formula (18), sequencing to obtain a correct image matching ranking, outputting a pedestrian matching image to identify a specific pedestrian, and finishing identification.

Through the mode, the multi-scale convolution feature fusion pedestrian re-identification method based on pose embedding mainly aims at retrieving and inquiring corresponding pedestrian pictures from a large number of pedestrian image databases and finding the pictures of the same pedestrian in the image databases through a pair of images. Under the influence of a complex background is filtered by separating a foreground and the background, the local characteristics of pedestrians are extracted by using a human body key point estimation method, and the robustness of a network model is enhanced by carrying out image preprocessing on a basic line network by using a random erasing method, so that the global characteristics with higher robustness are extracted; and finally, performing depth weighted fusion on the features with different scales, and improving the similarity measurement between the features by a reordering method.

Claims

1. A pose embedding-based multi-scale convolution feature fusion pedestrian re-identification method is characterized by comprising the following steps:

step 4, performing weighted fusion on the depth convolution characteristics and the local semantic characteristics to obtain weighted fusion characteristics, respectively measuring the distances between the images in the image test library and the image query library and the fusion characteristics, and generating an initial measurement list for the result after distance measurement;

2. The pose embedding-based multi-scale convolution feature fusion pedestrian re-identification method according to claim 1, characterized in that a specific way of performing baseline network optimization on a Resnet-50 network model is as follows:

in the above formula, m is the number of loss functions;

in the above formula, the first and second carbon atoms are,

is the feature vector of the anchor point sample,

is the feature vector of the positive sample,

is a feature vector of negative samples, alpha is

A distance between

3. The pose embedding-based multi-scale convolution feature fusion pedestrian re-identification method according to claim 1, wherein the step 2 specifically comprises the following steps:

step 2.2, deconvoluting the feature mapping into the size of an input image, adding a new convolution layer, and generating a prediction significance map;

and 2.3, firstly, applying the convolution layer with the core size of 1 multiplied by 1 in the network structure to a conv1-2 layer to generate boundary prediction, then adding the boundary prediction to a prediction significance map to obtain a refined boundary frame, and then, applying one convolution layer to carry out convolution on the refined boundary frame to obtain a significant human body image.

4. The pose embedding-based multi-scale convolution feature fusion pedestrian re-identification method according to claim 1, wherein the step 3 specifically comprises the following steps:

5. The pose embedding-based multi-scale convolution feature fusion pedestrian re-identification method according to claim 1, characterized in that the specific process of step 5 is as follows: