CN111177447B

CN111177447B - Pedestrian image identification method based on depth network model

Info

Publication number: CN111177447B
Application number: CN201911362901.4A
Authority: CN
Inventors: 杨育彬; 林喜鹏
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2019-12-26
Filing date: 2019-12-26
Publication date: 2021-04-30
Anticipated expiration: 2039-12-26
Also published as: CN111177447A

Abstract

The invention provides a pedestrian image identification method based on a depth network model, which comprises the following steps: carrying out data preprocessing on the pedestrian image; performing an adaptive sampling algorithm on the preprocessed data to obtain batches with more difficult samples; extracting multilayer features through a backbone network model, using sub-modules to enhance low-level features, performing scale reduction and splicing with high-level features to obtain the multilayer features, segmenting the multilayer features with different granularities to form a multi-branch structure, extracting component features and global features of each branch, and splicing all the extracted features to obtain depth representation of a pedestrian image; training the constructed network model; and extracting the depth representation of the query image through the trained network model, and returning the identification result of each query image according to the cosine distance similarity of each query image and the queried set. Through the multilayer and multi-granularity pedestrian re-identification depth model, the optimal pedestrian re-identification performance at the present stage is realized.

Description

Pedestrian image identification method based on depth network model

Technical Field

The invention relates to the field of machine learning and computer vision, in particular to a pedestrian image identification method based on a depth network model.

Background

With the development of modern society, public safety gradually receives attention of people. A large number of surveillance camera systems are installed in places, such as shopping malls, apartments, schools, hospitals, office buildings, large squares and the like, which are dense in crowds and are easy to have public safety incidents, and the research on surveillance videos is concentrated and is particularly used for identifying visible objects, especially pedestrians. This is because pedestrians are generally the target of the monitoring system. More specifically, the task of the surveillance system is to search for a specific pedestrian in the surveillance video data, i.e. the task of pedestrian re-identification.

However, on one hand, the data volume of the surveillance video is often very huge, and on the other hand, it is very challenging to find a specific pedestrian in the massive surveillance video data due to the influence of the factors such as the light, the shelters, the wearing of the pedestrian, the shooting angle, the camera and the like of the environment where the pedestrian is located. However, monitoring through manual identification is not only high in cost, but also low in efficiency and poor in stability, and it is unrealistic to only rely on manual identification to re-identify pedestrians in the long run. Therefore, the monitoring video data of the public safety places are quickly analyzed, the specific pedestrians are automatically found, the monitoring quality can be obviously improved, and the method has important significance for city construction and social safety guarantee.

In the existing pedestrian re-identification method, the pedestrian re-identification method based on the depth model of the component has the most advanced performance, however, the depth model based on the component at the present stage is usually to segment the high-level features in the backbone network to obtain the component features, on one hand, the high-level features of the depth model have high coupling, and the simple segmentation of the high-level features can cause the loss of semantic information, so that the performance of the model is limited. On the other hand, although the semantic information of the low-level features of the depth model is weak, the low-level features are often weakly coupled, the robustness on segmentation is better, and the problem of semantic information loss caused by segmentation can be relieved by combining the high-level features and the low-level features.

Disclosure of Invention

The purpose of the invention is as follows: the invention aims to solve the technical problem of the prior art, and provides a pedestrian image recognition method based on a depth network model to solve the problem of semantic information loss in the prior art of a pedestrian re-recognition method based on a depth model of a component. The invention comprises the following steps:

step 1, preprocessing pedestrian images in a pedestrian image data set, wherein the pedestrian image data set comprises a training set and a testing set, the testing set comprises a query set and a queried set, the pedestrian identities in the testing set and the training set are not repeated, and the query set and the queried set have the same pedestrian identity;

step 2, dynamically sampling the preprocessed training set;

step 3, constructing a network model for pedestrian re-identification;

step 4, training the network model constructed in the step 3;

and 5, re-identifying the pedestrian.

The step 1 comprises the following steps:

step 1-1, adjusting the size of an input pedestrian image by using a bicubic interpolation method, adjusting the size of the pedestrian image to 3K multiplied by K for any channel of pedestrian images with different sizes, wherein K is generally 128 or 192, and for any point P (0,0) in the image, defining the relative coordinate of 16 points in the periphery of the image including the point P (r, c), wherein r is larger than or equal to 1 and smaller than or equal to 2, and c is larger than or equal to 1 and smaller than or equal to 2; where r and c respectively represent the offset of the abscissa and the offset of the ordinate, a negative value represents a leftward or upward offset, and a positive value represents a rightward or downward offset, e.g., P (0,1) is an adjacent point to the right of P (0, 0);

wherein P (0,0) represents a pixel point (x) in the target interpolation graph₁,y₁) (x) the closest mapping point in the original image₁,y₁) And the coordinate offset of P (0,0) is represented as (u, v), and the absolute coordinate of P (0,0) in the original image is represented as (i, j), the bicubic interpolation method is the sum of the above 16-point convolution interpolations, i.e., the following interpolation function F (i + u, j + v):

wherein x is₁＝i+u,y₁J + c, f (i + r, j + c) represents the pixel value of any one point of the 16 points in the original image, and s (x) is a sampling formula, specifically, a sampling formula：

Wherein, a is a formula coefficient, and a common value can be-0.5;

step 1-2, by randomly horizontally flipping the pedestrian image: randomly horizontally flipping any channel of a pedestrian image with the size of 3 KxK with the probability P1 of 0 < P1 < 1 and a second arbitrary point (x) on the pedestrian image₂,y₂) Coordinates (x) of the symmetrical point after turning in the horizontal direction_f,y_f) Comprises the following steps:

(x_f,y_f)＝(x₂,3K-y₂-1)

wherein (x)₂,y₂) Is the coordinate of a second arbitrary point in the pedestrian image, x is more than or equal to 0₂≤3K,0≤y₂≤K；

Step 1-3, by randomly erasing the pedestrian image: randomly erasing a random area with the size of h multiplied by w according to the following random erasing function f () with the probability P2, 0 < P2 < 1 for any channel of a pedestrian image with the size of 3 KxK, and setting all pixel values of each channel in the random area as the pixel value mean value of the channel:

f(x₃:x₃+h,y₃:y₃+w)＝m，

wherein (x)₃,y₃) X is more than or equal to 0 and is the coordinate of a third arbitrary point in the pedestrian image₃≤3K,0≤y₃K is less than or equal to K, and m is the pixel value mean value of each channel in the pedestrian image;

step 1-4, carrying out data standardization processing on data of each channel of the pedestrian image: carrying out data normalization and normalization processing on any channel of the pedestrian image with the size of 3 KxK according to the following normalization function f (x):

wherein x is the pixel value of any point under each channel of the pedestrian image obtained in the step 1-3, x is more than or equal to 0 and less than or equal to 255, mu is the mean value of the public data set ImageNet, and delta is the standard deviation of the public data set ImageNet.

The step 2 comprises the following steps:

step 2-1, counting an index list corresponding to a pedestrian image of each identity in a training set, wherein the pedestrian image in the training set is a training sample, defining a dictionary set of an unsampled sample index list as US, a set of correct classification of a model as TS, a set of incorrect classification of the model as FS, initializing TS and FS as empty, and US as a dictionary set formed by all current training samples;

step 2-2, performing dynamic sampling, and acquiring a batch consisting of P pedestrians and Q images corresponding to the P pedestrians from the training set under the current iteration turn, so that the identities of the P pedestrians are randomly sampled from a label list of the training set;

step 2-3, preferentially sampling and acquiring Q images from the US set for each pedestrian identity acquired in the step 2-2, if the US set is empty or the number of the pedestrian images with the residual corresponding identities is less than Q, sampling and complementing from the FS set, if the number of the pedestrian images is still insufficient, sampling and complementing from the TS set, and if the number of the pedestrian images is still insufficient, circulating the step 2-3 and repeatedly sampling until Q images are acquired;

step 2-4, after each iteration sampling, transferring the samples sampled in the current iteration round from the US set to the FS set, simultaneously transferring the samples correctly classified by the model from the FS set to the TS set, and transferring the samples wrongly classified by the model from the TS set to the FS set;

step 2-5, the step 2-3 and the step 2-4 are circulated until a batch with the size of P multiplied by Q is obtained by sampling;

the step 3 comprises the following steps:

step 3-1, constructing a network model for pedestrian re-identification, wherein the network model comprises a backbone network model and sub-modules;

extracting multilayer features through a backbone network model, namely extracting features of different depths, wherein the features of different depths comprise: first layer depth feature l₁Second layer depth feature l₂The third layer is deepDegree characteristic l₃And a fourth layer depth feature l₄(ii) a The backbone network model selects a classical classification network ResNet of an ImageNet data set;

the sub-modules comprise an enhancement module, a downscaling module, a reduction module and a maximum pooling layer module; defining a first layer depth feature l₁And a second layer depth feature l₂A low level feature, a third level depth feature₃And a fourth layer depth feature l₄Then it is a high-level feature;

when the first layer depth feature l₁When the size of the second layer is C multiplied by H multiplied by W, the second layer depth feature l is obtained according to the backbone network model₂Is 2 CxH/2 xW/2, the third layer depth feature l₃Has a size of 4 CxH/4 xW/4, wherein C is the first layer depth feature l₁H is the first layer depth characteristic l₁W is the first layer depth characteristic l₁Is wide;

step 3-2, respectively enhancing the depth characteristics l of the first layer by two enhancing modules₁And a second layer depth feature l₂The size of the first layer depth feature l is kept unchanged, and then the first layer depth feature l passes through the two downscaling modules₁And a second layer depth feature l₂Are reduced to 2 CxH/4 xW/4, respectively;

step 3-3, reducing the third layer depth characteristic l by the reduction module₃The number of the channels is half of the original number, namely the size is reduced to 2 CxH/4 xW/4;

the downscaled first layer depth feature l is₁And a reduced third layer depth feature l₃Splicing according to the channel dimension to obtain a first multilayer depth feature l with the size of 2 CxH/4 xW/4₁₃；

The downscaled second layer depth feature l is₂And a reduced third layer depth feature l₃Splicing according to the channel dimension to obtain a second multilayer depth feature l with the size of 2 CxH/4 xW/4₂₃；

Step 3-4, the multilayer depth characteristic l obtained in the step 3-3 is processed₁₃And l₂₃And a backboneThird layer depth feature l in network model₃Fourth layer depth feature l respectively accessed into backbone network model₄A corresponding network layer forming the multi-branch structure, the global features including: first global feature l_4-1Second global feature l_4-2And a third global feature l_4-3(ii) a First global feature l_4-1By third layer depth features l in the backbone network model₃Accessing fourth layer depth feature l₄The corresponding network layer obtains a fourth layer depth feature equivalent to the backbone network model, and a second global feature l_4-2By a 1₂₃Accessing fourth layer depth feature l₄The corresponding network layer obtains the third global feature l_4-3Then pass through₁₃Accessing fourth layer depth feature l₄Obtaining a corresponding network layer;

segmenting the global features into component features, including: the first global feature l is combined_4-1Cutting into first part features with granularity of 1, and dividing the second global features l into first part features with granularity of 1_4-2Cutting into second part features with granularity of 2, and dividing the third global features l_4-3A third part feature cut to a grain size of 3;

pooling the resolutions of the global features and the component features to 1 × 1 by using a maximum pooling layer module, further reducing the number of channels of the global features and the component features to F, where F is generally 9, the reduction module is a convolution kernel shared by 1 × 1 convolution layers, the size of each of the global features and the component features after reduction is F × 1 × 1, and a set formed by the component features after reduction is denoted as S;

and splicing all the reduced global features and the reduced component features to obtain the depth representation of the constructed pedestrian image, wherein the size is M multiplied by F, and M is the total number of the global features and the component features.

Step 4 comprises the following steps:

step 4-1, defining experiment related configuration: before training a network model on a training set, firstly defining a model optimizer for updating parameters; setting the size of the batch of the dynamic sampling in the step 2 to be P multiplied by Q, wherein P represents the number of the pedestrian identities included in each batch, and Q represents the number of the pedestrian images included in each pedestrian identity; finally, a learning rate scheduler is set; the training set is provided with a pedestrian identity label, and the number of the pedestrian identity label classes of the training set is recorded as Y;

step 4-2, optimizing each global feature in the step 3 respectively: averaging each global feature by a modified ternary loss function for the feature metric, L_tripletComprises the following steps:

where G denotes the number of global features, G-3,

an anchor sample representing the g-th global feature of the i-th pedestrian identity,

a positive sample of the g-th global feature representing the identity of the ith pedestrian,

a negative sample of the g global feature representing the identity of the j pedestrian; wherein alpha is a hyper-parameter for controlling the difference between the inter-class distance and the intra-class distance, alpha is more than 1.0 and less than 1.5, i is more than or equal to 1 and less than or equal to P, and a is more than or equal to 1 and less than or equal to Q;

step 4-3, optimizing each reduced component feature obtained in the step 3-4 by respectively using a cross entropy loss function of identity classification, wherein each component feature uses a linear classifier without bias items, the component features correspond to the linear classifiers one by one, and the cross entropy loss function L of the identity classification_idIs as follows;

wherein fc_jDenotes the jth Linear classifier, f_jqRepresenting the jth part characteristic f_jJ is more than or equal to 1 and less than or equal to N, Q is more than or equal to 1 and less than or equal to PxQ of the vectors of the Q-th pedestrian image in a batch; n represents the total number of linear classifiers, i.e., the number of component features; 1_r＝yExpressing a one-hot coded vector with the length of the identity number of the pedestrian, wherein the index r of the one-hot element is equal to the identity true value y of the pedestrian image;

step 4-4, adding the cross entropy loss function and the improved ternary loss function to obtain a loss function L used in final training, which is as follows:

L＝L_triplet+L_id，

and 4-5, performing model training of the network model on the training set.

In steps 4-5, when model training of the network model is performed on the training set, the input is as follows: training set D; a pedestrian identity tag y; the iteration number T; a sampler S, an optimizer OPT, a learning rate scheduler LR; initialization parameter theta₀The index 0 is the current iteration number, the initial model phi (x; theta)₀) (ii) a The output is: model phi (x; theta)_T) (ii) a The specific training process comprises the following steps:

step 4-5-1, loading a pre-training model theta on the public data set ImageNet₀；

Step 4-5-2, the sampler S dynamically samples N from the training set D according to the configuration of step 3-1_bIndividual preprocessed pedestrian image

x_iRepresenting the ith pre-processed pedestrian image, where N_b＝P×Q；

4-5-3, clearing the accumulated gradient by an optimizer OPT;

step 4-5-4, extracting global features and component features:

step 4-5-5, obtaining loss value by using loss function in step 3-4

4-5-6, performing back propagation according to the loss value loss;

step 4-5-7, the optimizer OPT updates the model parameter theta_tMeanwhile, the learning rate scheduler LR updates the learning rate;

and 4-5-8, circularly and iteratively executing the steps 4-5-2 to 4-5-7 until the iteration number reaches T.

The step 5 comprises the following steps:

step 5-1, loading the network model trained in the step 4, and extracting pedestrian images in a test set by using the network model, namely extracting the depth representations of the query images in the query set and the queried images in the queried set;

all global features and component features in the test set are stitched together as defined in steps 3-4, each feature of the test set being represented as:

wherein N is_testRepresents the test set, θ_TRepresenting a parameter set when the iteration number is T;

the depth characterization of the final extracted pedestrian image is as follows:

step 5-2, eliminating the deviation between a training set and a test set in a pedestrian image data set, and representing the depth of the pedestrian image

And depth characterization of the flipped pedestrian image

Additive, depth characterization of pedestrian images as test set

Step 5-3, normalizing the depth representation of the pedestrian image obtained in the step 5-2 by using a two-norm

The two-norm is calculated according to the following formula:

the depth characterization of the pedestrian image normalized using the two-norm to obtain the final test set is as follows:

step 5-4, according to the depth characterization of the pedestrian images of the final test set, calculating the distance between each pedestrian image in the query set and each pedestrian image in the queried set, obtaining the query result of each pedestrian image in the query set, and realizing pedestrian re-identification;

step 5-4 comprises: if the depth of each pedestrian image in the query set is characterized as

The depth of each pedestrian image in the queried set is characterized as

The distance matrix between the query set and the queried set is:

wherein N is_galleryRepresenting a queried set, N_queryRepresenting a set of queries, M_jiIs the element of the ith row and the jth column of the matrix; according to the sequence from small to large, each is searchedAnd sequencing the distances between the inquiry images and each pedestrian image in the inquired set to obtain the identification result of each inquiry image.

Has the advantages that:

in the prior art, the problem of semantic information loss exists in the depth model based on the component because of high coupling of high-level features, but by adopting the method disclosed by the invention, the problem of semantic information loss of the high-level features can be inhibited through the depth model based on multiple layers and multiple granularities, so that the pedestrian re-recognition performance of the depth model based on the component is improved, and the method for constructing the pedestrian depth representation, training the model and finally finishing the pedestrian re-recognition is carried out based on data preprocessing, so that the optimal pedestrian re-recognition performance in the current stage is realized.

Drawings

The foregoing and/or other advantages of the invention will become further apparent from the following detailed description of the invention when taken in conjunction with the accompanying drawings.

FIG. 1 is a schematic flow chart of a multi-layer multi-granularity pedestrian re-identification depth model according to an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of a multi-layer multi-granularity pedestrian re-identification depth model according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a convolutional network structure of an enhancement module, a downscaling module and a reduction module in a multilayer multi-granularity pedestrian re-identification depth model according to an embodiment of the present invention, where a maximum pooling layer module is a basic network pooling layer;

fig. 4 is a diagram of an example of query results in a multi-layer multi-granularity pedestrian re-identification depth model according to an embodiment of the present invention.

Fig. 5 is a schematic diagram defining relative coordinates of 16 points including itself around any one point P00.

Detailed Description

The embodiment of the invention provides a pedestrian image identification method based on a deep network model, which is applied to rapidly analyzing monitoring video data of public security places, automatically finding out specific pedestrians, remarkably improving monitoring quality and having important significance on city construction and social security.

As shown in fig. 1, a schematic workflow diagram of a pedestrian image recognition method based on a depth network model provided in the embodiment of the present invention is provided, and the embodiment discloses a pedestrian image recognition method based on a depth network model, including:

step 1, preprocessing pedestrian images in a pedestrian image data set, comprising: in the step, the pedestrian image data set comprises a training set and a test set, the test set comprises a query set and a queried set, and specifically, the pedestrian image data set used in the invention is a pedestrian image data set with any public standard, such as Market-1501, DukeMTMC-reiD and the like. The data enhancement comprises random horizontal turning and random erasing, and the pedestrian image can be obtained through manual labeling or a pedestrian detection algorithm. In this embodiment, through carrying out data preprocessing to the pedestrian image in the pedestrian image data set, the variety of sample can be effectively improved.

Step 2, after data preprocessing is completed, dynamic sampling is needed to be carried out, sample batches for training are obtained, the index list corresponding to the pedestrian images of each identity in a training set is counted, a dictionary set of the sample index lists which are not sampled is defined as US, a set of correct classification of a model is defined as TS, a set of wrong classification of the model is defined as FS, TS and FS are initialized to be empty, US is a dictionary set formed by all current training samples, and then a dynamic sampling algorithm is executed;

step 3, constructing a network model for pedestrian re-identification, namely constructing a depth representation of the pedestrian image, and comprising the following steps: extracting multilayer features through a backbone network model, using sub-modules to enhance and reduce the scale of low-level features, combining the high-level features and the low-level features to obtain the multilayer features, forming a multi-branch structure, and extracting the component features and the global features of each branch. In this step, the global features of the branches are used for representing corresponding pedestrian images, and the sub-modules include a lateral enhancement module, a downscaling module, a reduction module and a maximum pooling layer module. The schematic diagram of the network structure in the multilayer multi-granularity pedestrian re-identification depth model provided by the embodiment of the invention is shown in fig. 2. In fig. 2, an arrow with a reference number 0 indicates each layer of the backbone network, an arrow with a reference number 1 indicates an enhancement module, an arrow with a reference number 2 indicates a downscaling module, an arrow with a reference number 3 indicates a maximum pooling layer module, and an arrow with a reference number 4 indicates a reduction module.

Step 4, training the network model constructed in step 3, including: defining experiment related configuration, and optimizing model parameters of the network model, specifically, in this embodiment, optimizing the model parameters by combining a cross entropy loss function of identity classification and an improved ternary loss function for feature measurement. The loss function used in the final training is the sum of the average cross entropy loss function for each component and the average modified ternary loss function for each global feature.

And 5, re-identifying the pedestrian, comprising the following steps: under the condition that the identity of the pedestrians in the test set and the identity of the pedestrians in the training set are not repeated, extracting the depth representation of the query image through the network model trained in the step 4, normalizing the depth representation of the query image by using a two-norm method, and returning the identification result of each query image according to the similarity of each query image and the queried set based on the cosine distance. In the step, the pedestrian is re-identified under the condition that the pedestrian identity is not repeated, and the effectiveness of the model can be verified through the returned identification result.

In the modern society, the monitoring video data of public safety places are quickly analyzed, specific pedestrians are automatically found, the monitoring quality can be obviously improved, and the method has important significance for city construction and social safety. The invention provides a multilayer and multi-granularity pedestrian re-identification depth model and realizes the best pedestrian re-identification performance at the present stage.

In the following, the steps of the present invention are described in detail, and in the multi-layer and multi-granularity pedestrian re-identification depth model according to this embodiment, the step 1 includes:

step 1-1, adjusting the size of an input pedestrian image by using a bicubic interpolation method, adjusting the size of the pedestrian image to 3K × K for any channel of pedestrian images with different sizes, and defining relative coordinates of 16 points including the pedestrian image at any point P00, as shown in fig. 5; -1. ltoreq. r.ltoreq.2, -1. ltoreq. c.ltoreq.2; where r and c respectively represent the offset of the abscissa and the offset of the ordinate, a negative value represents a leftward or upward offset, and a positive value represents a rightward or downward offset, e.g., P (0,1) is an adjacent point to the right of P (0, 0);

where P00 represents a certain pixel point (x) in the target interpolation graph₁,y₁) The closest mapping point in the original image, (x)₁,y₁) And the coordinate offset of P00 is (u, v), and the absolute coordinate of P00 in the original image is (i, j), the bicubic interpolation method is the sum of the above 16-point convolution interpolations, that is, the following interpolation function:

here, x₁＝i+u,y₁J + v, f (i + r, j + c) represents the pixel value of the original image at any point of the 16 points in the original image, and s (x) is a sampling formula, specifically:

wherein, a is a formula coefficient, and a common value can be-0.5;

step 1-2, performing data enhancement by randomly and horizontally overturning the pedestrian image, comprising: for any channel of a pedestrian image with the size of 3 KxK, the pedestrian image is randomly horizontally overturned by the probability P1, 0 < P1 < 1, in the embodiment, the probability P1 is 0.5 in the practical experiment, and the pedestrian image is provided with a second arbitrary point (x)₂,y₂) The coordinates of the flipped symmetry point with respect to the horizontal direction are:

(x_f,y_f)＝(x₂,3K-y₂-1)

wherein (x)₂,y₂) Is the coordinate of a second arbitrary point in the pedestrian image, x is more than or equal to 0₂≤3K,0≤y₂≤K。

Step 1-3, performing data enhancement by randomly erasing the pedestrian image, including: for any channel of a pedestrian image with the size of 3 KxK, the probability P2, 0 < P2 < 1 is adopted, in the embodiment, the probability P2 is 0.5 in an actual experiment, a random area with the size of h x w is randomly erased according to the following random erasing function, and the pixel value of each channel in the random area is set as the pixel value mean value of the channel:

f(x₃:x₃+h,y₃:y₃+w)＝m

wherein (x)₃,y₃) X is more than or equal to 0 and is the coordinate of a third arbitrary point in the pedestrian image₃≤3K,0≤y₃K is less than or equal to K, and m is the pixel value mean value of each channel in the pedestrian image.

Step 1-4, carrying out data standardization processing on data of each channel of the pedestrian image, wherein the data standardization processing comprises the following steps: data normalization and normalization processing is performed on any channel of a pedestrian image of 3K × K in size according to the following normalization and normalization function:

wherein x is the pedestrian image obtained in the step 1-3, x is more than or equal to 0 and less than or equal to 255, mu is the mean value of the public data set ImageNet, and delta is the standard deviation of the public data set ImageNet. In this embodiment, the mean and variance of each channel are actually used on the ImageNet data set, specifically, the mean of each channel of RGB is 0.485, 0.456, 0.406, and the variance is 0.229, 0.224, 0.225.

Step 2, after finishing data preprocessing, carrying out dynamic sampling to obtain sample batches for training;

step 2-1, counting an index list corresponding to a pedestrian image of each identity in a training set, defining a dictionary set of an index list of samples which are not sampled as US, a correctly classified set of models as TS, an incorrectly classified set of models as FS, initializing TS, FS as null, and US as a dictionary set formed by all current training samples;

step 2-2, performing dynamic sampling, and under the current iteration turn, acquiring a batch consisting of P pedestrians and Q images of each pedestrian from a training set, so that the identities of the P pedestrians are randomly sampled from a label list of the training set;

step 2-3, preferentially sampling and acquiring Q images from the US set for each pedestrian identity acquired in the step 2-2, if the US set is empty or the number of the pedestrian images with the residual corresponding identities is less than Q, sampling and complementing from the FS set, if the number of the pedestrian images is still insufficient, sampling and complementing from the TS set, and if the number of the pedestrian images is still insufficient, circulating the step and repeatedly sampling until Q images are acquired;

after that, a depth characterization of the pedestrian image needs to be constructed through step 3, in the multilayer and multi-granularity pedestrian re-identification depth model according to the embodiment, the step 3 includes:

step 3-1, extracting multilayer features through a backbone network model, wherein in this embodiment, the backbone network model refers to an existing basic deep convolutional neural network model, such as ResNet, VGG, and the like, and features of different depths can be extracted through a backbone network ResNet101, and the features of different depths include: first layer depth feature l₁Second layer depth feature l₂Third layer depth feature l₃And a fourth layer depth feature l₄In FIG. 2, the fourth layer depth feature l₄Not shown, the sub-modules include an enhancement module, a downscaling module, a maximum pooling layer module, and a reduction module, and specifically, in fig. 2, an arrow labeled 0 indicates each layer of the backbone network, and the label is labeledThe arrow at 1 indicates the enhancement module, the arrow at 2 indicates the downscaling module, the arrow at 3 indicates the max pooling layer module, and the arrow at 4 indicates the reduction module.

As shown in fig. 3, a schematic diagram of a convolutional network structure of an enhancement module, a downscaling module and a reduction module in a multi-layer and multi-granularity pedestrian re-identification depth model provided in the embodiment of the present invention is provided, where Conv is a convolutional layer, the number after Conv is a convolutional kernel size of the convolutional layer, BatchNorm2d is a batch normalization layer, ReLU is a non-linear activation function layer, and a maximum pooling layer module is an existing common basic module, which is not given here.

In this embodiment, the step 3-2 includes: enhancing the first layer depth feature l by an enhancement module₁And a second layer

Depth feature l₂The characterization capability of (2). Scaling the first layer depth features l by two downscaling modules₁And a second layer depth feature l₂Is reduced to the reduced third layer depth feature l₃Are consistent in size.

When the first layer depth feature l₁When the size of (a) is C × H × W, in this embodiment, W is generally K/4, and H is generally 3W, and the second-layer depth feature l is obtained according to the backbone network model₂Is 2 CxH/2 xW/2, the third layer depth feature l₃Is 4 CxH/4 xW/4, reduced third layer depth feature l₃Has a size of 2 CxH/4 xW/4, wherein C is the number of channels and H is the first layer depth feature l₁Is 96 in this example, and W is the first layer depth feature l₁Is 32 in this example.

Step 3-3, after passing through the two downscaling modules, the first layer depth feature l₁And a second layer depth feature l₂Is reduced to the reduced third layer depth feature l₃The sizes of (A) and (B) are consistent, namely 2C multiplied by H/4 multiplied by W/4;

characterizing the first layer depth i₁And a reduced third layer depth feature l₃Splicing according to the channel dimension to obtain a template with the size of 4 CxH/4W/4 depth feature, dimension and third layer depth feature before reduction₃Keeping consistent to obtain a first multilayer depth characteristic l with the size of 2 CxH/4 xW/4₁₃；

Characterizing the second layer depth i₂And third layer depth characteristic l₃Splicing according to the channel dimension to obtain the depth feature with the size of 4 CxH/4 xW/4, the size and the third layer depth feature l before reduction₃Keeping consistent to obtain a second multilayer depth characteristic l with the size of 2 CxH/4 xW/4₂₃；

Step 3-4, the multilayer depth characteristic l obtained in the step 3-3 is processed₁₃And l₂₃And third layer depth characteristics l in the backbone network₃Fourth layer depth feature l separately accessed in backbone network₄A corresponding network layer forming the multi-branch structure, the global features including: first global feature l_4-1Second global feature l_4-2And a third global feature l_4-3；

reducing the number of channels of the global features and the component features to F further by using a reduction module, pooling the sizes of the global features and the component features to 1 × 1, wherein the reduction module is a shared convolution kernel of 1 × 1 convolution layer, the size of each reduced global feature and component feature is F × 1 × 1, and a set formed by the reduced component features is marked as S; specifically, in this embodiment, F is 256.

And splicing all the reduced global features and component features to obtain a depth representation of the constructed pedestrian image, wherein the size is M × F, M is the total number of the global features and the component features, and specifically, in this embodiment, M is 9.

In the multilayer and multi-granularity pedestrian re-identification depth model according to this embodiment, the step 4 includes:

step 4-1, defining relevant configuration of the experiment, comprising: before training the pedestrian re-recognition model on the training set, firstly defining a model optimizer for updating parameters, specifically, in the embodiment, using an Adam optimizer, loading parameters of the pedestrian re-recognition model constructed in the step 3, and using an AMSGrad method; the batch size of the input images is set to be P multiplied by Q, wherein P represents the number of pedestrian identities included in each batch, and Q represents the number of pedestrian images included in each pedestrian identity. Specifically, in this embodiment, P is 12, and Q is 4; finally, a learning rate scheduler is set; the training set is contained in an open pedestrian image data set, the training set is provided with pedestrian identity labels, and the number of the pedestrian identity label classes of the training set is recorded as Y. Specifically, in this embodiment, a multistep learning rate scheduler multistep lr is used, and when the training reaches a preset iteration time point, the learning rate is reduced to be twice the original gamma, in this embodiment, the gamma is 0.1, and an iteration time point is preset every 40 iterations.

Step 4-2, optimizing each global feature in the step 3 respectively, including: averaging each global feature by a modified ternary loss function for the feature metric, the modified ternary loss function being:

where G denotes the number of global features, G-3,

g-th global feature representing jth pedestrian identityA negative sample of (d); wherein alpha is a hyper-parameter for controlling the difference between the inter-class distance and the intra-class distance, alpha is more than 1.0 and less than 1.5, i is more than or equal to 1 and less than or equal to P, a is more than or equal to 1 and less than or equal to Q, and in the embodiment, alpha is 1.2.

4-3, optimizing each reduced component feature obtained in the step 3-4 by using an identity-classified cross entropy loss function, in this embodiment, because identity classification needs to keep output dimensionality consistent with the number Y of pedestrian identity labels, a linear layer without a bias term needs to be added to each component feature, so that the component feature with dimensionality F sets the output dimensionality as Y through the linear layer, each component feature uses a linear classifier without a bias term, the component features correspond to the linear classifiers one to one, and the identity-classified cross entropy loss function is as follows;

wherein fc_jDenotes the jth Linear classifier, f_jqRepresenting the jth part characteristic f_jThe vector of the qth pedestrian image in a batch, 1 ≦ j ≦ N, 1 ≦ Q ≦ PxQ, which represents the size of a batch, N representing the total number of linear classifiers, i.e., the number of component features, 1, as described in step 3-1_r＝yAnd the single-hot coded vector with the length of the identity number of the pedestrian is represented, wherein the index r of the single-hot element is equal to the identity true value y of the pedestrian image.

Step 4-4, adding the average cross entropy loss function of each component feature and the average improved ternary loss function of each global feature to obtain a loss function used in final training, as follows:

L＝L_triplet+L_id；

and 4-5, performing model training of the network model on the training set. The specific training algorithm is as follows:

inputting: training set D; a pedestrian identity tag y; the iteration number T; a sampler S, an optimizer OPT, a learning rate scheduler LR; initialization parameter theta₀Subscript is currentNumber of iterations, initial model Φ (x; θ)₀)；

And (3) outputting: model phi (x; theta)_T)；

x_iRepresenting the ith pre-processed pedestrian image, where N_b＝P×Q；

4-5-3, clearing the accumulated gradient by an optimizer OPT;

step 4-5-4, extracting global features and component features:

step 4-5-5, obtaining loss value by using loss function in step 3-4

4-5-6, performing back propagation according to the loss value loss;

step 4-5-8, circularly and iteratively executing the step 4-5-2 to the step 4-5-7 until the iteration number reaches T;

wherein, the parameter subscript number t in the model output by the training algorithm represents the current iteration number, and the batch size N_b＝P×Q。

In the multilayer and multi-granularity pedestrian re-identification depth model according to this embodiment, the step 5 includes:

and 5-1, loading the network model trained in the step 4, and extracting the depth characterization of the pedestrian image in a test set, wherein the test set comprises a query set and a queried set, namely extracting the query image and the depth characterization of the queried image by using the model.

As defined in steps 3-4, all global features and component features in the test set are stitched together, each feature of the test set being represented as:

and 5-2, eliminating the deviation between a training set and a test set in the enhanced pedestrian image data set, remarkably changing data distribution due to random horizontal inversion of the training set, and representing the depth of the pedestrian image by considering the inverted pedestrian image during specific test

And depth characterization of the flipped pedestrian image

Additive, pedestrian depth characterization as test set

Specifically, in this embodiment, the flipping function is shown as step 1-2.

Step 5-3, normalizing the pedestrian depth characterization obtained in the step 5-2 by using a two-norm method

The two-norm is calculated according to the following formula:

the pedestrian depth characterization using the two-norm normalization to obtain the final test set is:

step 5-4, according to the pedestrian depth representation of the final test set, calculating the distance between each pedestrian image in the query set and each pedestrian image in the queried set, obtaining the query result of each pedestrian image in the query set, and realizing pedestrian re-identification;

if the depth of each pedestrian image in the query set is characterized as

The depth of each pedestrian image in the queried set is characterized as

The distance matrix between the query set and the queried set is:

wherein N is_galleryRepresenting a queried set, N_queryRepresenting a set of queries;

the distances between each query image and each pedestrian image in all the queried sets are ranked according to the sequence from small to large, the smaller the distance between the pedestrian image in the queried set and the query image is, the higher the possibility that the pedestrian is the same is, and therefore the identification result of each query image can be obtained, and the first ten query results are generally taken for evaluation.

As shown in fig. 4, for the example diagram of the query result in the multilayer multi-granularity pedestrian re-identification depth model provided in the embodiment of the present invention, where √ represents a correct search, and √ represents an incorrect search, in each example query, the first row is the query result obtained in the present invention, and the second row is the query result of the classical component model PCB, it can be seen that the method of the present invention is significantly better than the classical component model PCB, and the best pedestrian re-identification performance at the present stage is achieved.

According to the technical scheme, the embodiment of the invention provides a multilayer and multi-granularity pedestrian re-identification depth model, which comprises the following steps: step 1, preprocessing pedestrian images in a pedestrian image data set, comprising: adjusting the size of the pedestrian image, enhancing data, and performing data normalization and standardization processing on the pedestrian image after data enhancement, wherein the pedestrian image data set comprises a training set, a query set and a queried set; (ii) a Step 3, constructing a network model for pedestrian re-identification, namely constructing a depth representation of the pedestrian image, and comprising the following steps: extracting multilayer features through a backbone network model, enhancing and fusing the multilayer features by using sub-modules to form a multi-branch structure, and extracting component features and global features of each branch; step 4, training the network model constructed in step 3, including: defining experiment related configuration, and optimizing model parameters of the network model; and 5, re-identifying the pedestrian, comprising the following steps: extracting the depth characterization of the query image through the network model trained in the step 4, normalizing the depth characterization of the query image by using a two-norm model, and returning the identification result of each query image according to the similarity of each query image and the queried set based on the cosine distance.

In the prior art, because a depth model based on a component often only segments high-level features with high coupling in a backbone network, semantic information beneficial to re-identification of the high-level features is lost due to segmentation, so that the performance of re-identification of pedestrians through the depth model of the component is unstable.

By adopting the method, the problem that semantic information is lost after high-level feature segmentation is solved through the depth features based on multiple layers and multiple granularities, so that the pedestrian re-recognition performance of the depth model based on the components is improved, the pedestrian depth characterization is constructed based on data preprocessing, the model is trained, the pedestrian re-recognition is finally completed, and the best pedestrian re-recognition performance in the current stage is realized.

In particular implementations, the present invention also provides a computer storage medium, where the computer storage medium may store a program that, when executed, may include some or all of the steps in embodiments of a multi-layered, multi-granular pedestrian re-identification depth model provided by the present invention. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM) or a Random Access Memory (RAM).

Those skilled in the art will readily appreciate that the techniques of the embodiments of the present invention may be implemented as software plus a required general purpose hardware platform. Based on such understanding, the technical solutions in the embodiments of the present invention may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.

The present invention provides a pedestrian image recognition method based on a deep network model, and the method and the way for implementing the technical solution are many, the above description is only the preferred embodiment of the present invention, it should be noted that, for those skilled in the art, without departing from the principle of the present invention, several improvements and embellishments can be made, and these improvements and embellishments should also be regarded as the protection scope of the present invention. All the components not specified in the present embodiment can be realized by the prior art.

Claims

1. A pedestrian image identification method based on a depth network model is characterized by comprising the following steps:

step 2, dynamically sampling the preprocessed training set;

step 3, constructing a network model for pedestrian re-identification;

step 4, training the network model constructed in the step 3;

step 5, re-identifying the pedestrian;

the step 1 comprises the following steps:

step 1-1, adjusting the size of an input pedestrian image by using a bicubic interpolation method, adjusting the size of the pedestrian image to be 3K multiplied by K for any channel of pedestrian images with different sizes, and defining the relative coordinates of 16 points around the pedestrian image including the pedestrian image as P (r, c) for any point P (0,0) in the image, wherein r is more than or equal to 1 and less than or equal to 2, and c is more than or equal to 1 and less than or equal to 2; r, c respectively represent the offset of the abscissa and the offset of the ordinate, a negative value represents a leftward or upward offset, and a positive value represents a rightward or downward offset;

wherein x is₁＝i+u,y₁J + v, f (i + r, j + c) represents the pixel value of any one point of the 16 points in the original image, and s (x) is a sampling formula, specifically:

wherein a is a formula coefficient;

step 1-2, passing random waterFlat flipping the pedestrian image: randomly horizontally flipping any channel of a pedestrian image with the size of 3 KxK with the probability P1 of 0 < P1 < 1 and a second arbitrary point (x) on the pedestrian image₂,y₂) Coordinates (x) of the symmetrical point after turning in the horizontal direction_f,y_f) Comprises the following steps:

(x_f,y_f)＝(x₂,3K-y₂-1)

f(x₃:x₃+h,y₃:y₃+w)＝m，

wherein x is the pixel value of any point under each channel of the pedestrian image obtained in the step 1-3, x is more than or equal to 0 and less than or equal to 255, mu is the mean value of the public data set ImageNet, and delta is the standard deviation of the public data set ImageNet;

the step 2 comprises the following steps:

the step 3 comprises the following steps:

extracting multilayer features through a backbone network model, namely extracting features of different depths, wherein the features of different depths comprise: first layer depth feature l₁Second layer depth feature l₂Third layer depth feature l₃And a fourth layer depth feature l₄(ii) a The backbone network model selects a classical classification network ResNet of an ImageNet data set;

the sub-modules comprise an enhancement module, a downscaling module, a reduction module and a maximum pooling layer module; defining a first layer depth feature l₁And a second layer depth feature l₂Is a low layer characteristicCharacterization, third layer depth feature l₃And a fourth layer depth feature l₄Then it is a high-level feature;

Step 3-4, the multilayer depth characteristic l obtained in the step 3-3 is processed₁₃And l₂₃And third layer depth features l in the backbone network model₃Fourth layer depth feature l respectively accessed into backbone network model₄The corresponding network layer forms a multi-branch structure, and the global characteristics comprise: first global feature l_4-1Second global feature l_4-2And a third global feature l_4-3(ii) a First global feature l_4-1By third layer depth in the backbone network modelCharacteristic l₃Accessing fourth layer depth feature l₄The corresponding network layer obtains a fourth layer depth feature equivalent to the backbone network model, and a second global feature l_4-2By a 1₂₃Accessing fourth layer depth feature l₄The corresponding network layer obtains the third global feature l_4-3Then pass through₁₃Accessing fourth layer depth feature l₄Obtaining a corresponding network layer;

pooling the resolutions of the global features and the component features to 1 × 1 by using a maximum pooling layer module, further reducing the number of channels of the global features and the component features to F by using a reduction module, wherein the reduction module is a shared convolution kernel of 1 × 1 convolution layer, the size of each reduced global feature and component feature is F × 1 × 1, and a set formed by the reduced component features is denoted as S;

splicing all the reduced global features and component features to obtain the depth representation of the constructed pedestrian image, wherein the size is M multiplied by F, and M is the total number of the global features and the component features;

step 4 comprises the following steps:

step 4-2, optimizing each global feature in the step 3 respectively: averaging each global feature by a modified ternary penalty function for the feature metricLoss function L_tripletComprises the following steps:

wherein G represents the number of global features,

L＝L_triplet+L_id，

4-5, performing model training of a network model on the training set;

x_iRepresenting the ith pre-processed pedestrian image, where N_b＝P×Q；

4-5-3, clearing the accumulated gradient by an optimizer OPT;

step 4-5-4, extracting global features and component features:

step 4-5-5, obtaining loss value by using loss function in step 3-4

4-5-6, performing back propagation according to the loss value loss;

2. The method of claim 1, wherein step 5 comprises:

And depth characterization of the flipped pedestrian image

Additive, depth characterization of pedestrian images as test set

Step 5-3, normalizing the result of step 5-2 using a two-normDepth characterization of pedestrian images

The two-norm is calculated according to the following formula:

and 5-4, calculating the distance between each pedestrian image in the query set and each pedestrian image in the queried set according to the depth characterization of the pedestrian image in the final test set, obtaining the query result of each pedestrian image in the query set, and realizing pedestrian re-identification.

3. The method of claim 2, wherein steps 5-4 comprise: if the depth of each pedestrian image in the query set is characterized as

The depth of each pedestrian image in the queried set is characterized as

The distance matrix between the query set and the queried set is:

wherein N is_galleryRepresenting a queried set, N_queryRepresenting a set of queries, M_jiElements representing the ith row and the jth column in the matrix; according to the order from small to largeAnd sequencing the distance between each query image and each pedestrian image in the query set to obtain the identification result of each query image.