CN111177447B - Pedestrian image identification method based on depth network model - Google Patents

Pedestrian image identification method based on depth network model Download PDF

Info

Publication number
CN111177447B
CN111177447B CN201911362901.4A CN201911362901A CN111177447B CN 111177447 B CN111177447 B CN 111177447B CN 201911362901 A CN201911362901 A CN 201911362901A CN 111177447 B CN111177447 B CN 111177447B
Authority
CN
China
Prior art keywords
pedestrian
feature
features
pedestrian image
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911362901.4A
Other languages
Chinese (zh)
Other versions
CN111177447A (en
Inventor
杨育彬
林喜鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN201911362901.4A priority Critical patent/CN111177447B/en
Publication of CN111177447A publication Critical patent/CN111177447A/en
Application granted granted Critical
Publication of CN111177447B publication Critical patent/CN111177447B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/55Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/23Recognition of whole body movements, e.g. for sport training

Abstract

The invention provides a pedestrian image identification method based on a depth network model, which comprises the following steps: carrying out data preprocessing on the pedestrian image; performing an adaptive sampling algorithm on the preprocessed data to obtain batches with more difficult samples; extracting multilayer features through a backbone network model, using sub-modules to enhance low-level features, performing scale reduction and splicing with high-level features to obtain the multilayer features, segmenting the multilayer features with different granularities to form a multi-branch structure, extracting component features and global features of each branch, and splicing all the extracted features to obtain depth representation of a pedestrian image; training the constructed network model; and extracting the depth representation of the query image through the trained network model, and returning the identification result of each query image according to the cosine distance similarity of each query image and the queried set. Through the multilayer and multi-granularity pedestrian re-identification depth model, the optimal pedestrian re-identification performance at the present stage is realized.

Description

Pedestrian image identification method based on depth network model
Technical Field
The invention relates to the field of machine learning and computer vision, in particular to a pedestrian image identification method based on a depth network model.
Background
With the development of modern society, public safety gradually receives attention of people. A large number of surveillance camera systems are installed in places, such as shopping malls, apartments, schools, hospitals, office buildings, large squares and the like, which are dense in crowds and are easy to have public safety incidents, and the research on surveillance videos is concentrated and is particularly used for identifying visible objects, especially pedestrians. This is because pedestrians are generally the target of the monitoring system. More specifically, the task of the surveillance system is to search for a specific pedestrian in the surveillance video data, i.e. the task of pedestrian re-identification.
However, on one hand, the data volume of the surveillance video is often very huge, and on the other hand, it is very challenging to find a specific pedestrian in the massive surveillance video data due to the influence of the factors such as the light, the shelters, the wearing of the pedestrian, the shooting angle, the camera and the like of the environment where the pedestrian is located. However, monitoring through manual identification is not only high in cost, but also low in efficiency and poor in stability, and it is unrealistic to only rely on manual identification to re-identify pedestrians in the long run. Therefore, the monitoring video data of the public safety places are quickly analyzed, the specific pedestrians are automatically found, the monitoring quality can be obviously improved, and the method has important significance for city construction and social safety guarantee.
In the existing pedestrian re-identification method, the pedestrian re-identification method based on the depth model of the component has the most advanced performance, however, the depth model based on the component at the present stage is usually to segment the high-level features in the backbone network to obtain the component features, on one hand, the high-level features of the depth model have high coupling, and the simple segmentation of the high-level features can cause the loss of semantic information, so that the performance of the model is limited. On the other hand, although the semantic information of the low-level features of the depth model is weak, the low-level features are often weakly coupled, the robustness on segmentation is better, and the problem of semantic information loss caused by segmentation can be relieved by combining the high-level features and the low-level features.
Disclosure of Invention
The purpose of the invention is as follows: the invention aims to solve the technical problem of the prior art, and provides a pedestrian image recognition method based on a depth network model to solve the problem of semantic information loss in the prior art of a pedestrian re-recognition method based on a depth model of a component. The invention comprises the following steps:
step 1, preprocessing pedestrian images in a pedestrian image data set, wherein the pedestrian image data set comprises a training set and a testing set, the testing set comprises a query set and a queried set, the pedestrian identities in the testing set and the training set are not repeated, and the query set and the queried set have the same pedestrian identity;
step 2, dynamically sampling the preprocessed training set;
step 3, constructing a network model for pedestrian re-identification;
step 4, training the network model constructed in the step 3;
and 5, re-identifying the pedestrian.
The step 1 comprises the following steps:
step 1-1, adjusting the size of an input pedestrian image by using a bicubic interpolation method, adjusting the size of the pedestrian image to 3K multiplied by K for any channel of pedestrian images with different sizes, wherein K is generally 128 or 192, and for any point P (0,0) in the image, defining the relative coordinate of 16 points in the periphery of the image including the point P (r, c), wherein r is larger than or equal to 1 and smaller than or equal to 2, and c is larger than or equal to 1 and smaller than or equal to 2; where r and c respectively represent the offset of the abscissa and the offset of the ordinate, a negative value represents a leftward or upward offset, and a positive value represents a rightward or downward offset, e.g., P (0,1) is an adjacent point to the right of P (0, 0);
wherein P (0,0) represents a pixel point (x) in the target interpolation graph1,y1) (x) the closest mapping point in the original image1,y1) And the coordinate offset of P (0,0) is represented as (u, v), and the absolute coordinate of P (0,0) in the original image is represented as (i, j), the bicubic interpolation method is the sum of the above 16-point convolution interpolations, i.e., the following interpolation function F (i + u, j + v):
Figure GDA0002976432950000021
wherein x is1=i+u,y1J + c, f (i + r, j + c) represents the pixel value of any one point of the 16 points in the original image, and s (x) is a sampling formula, specifically, a sampling formula:
Figure GDA0002976432950000022
Wherein, a is a formula coefficient, and a common value can be-0.5;
step 1-2, by randomly horizontally flipping the pedestrian image: randomly horizontally flipping any channel of a pedestrian image with the size of 3 KxK with the probability P1 of 0 < P1 < 1 and a second arbitrary point (x) on the pedestrian image2,y2) Coordinates (x) of the symmetrical point after turning in the horizontal directionf,yf) Comprises the following steps:
(xf,yf)=(x2,3K-y2-1)
wherein (x)2,y2) Is the coordinate of a second arbitrary point in the pedestrian image, x is more than or equal to 02≤3K,0≤y2≤K;
Step 1-3, by randomly erasing the pedestrian image: randomly erasing a random area with the size of h multiplied by w according to the following random erasing function f () with the probability P2, 0 < P2 < 1 for any channel of a pedestrian image with the size of 3 KxK, and setting all pixel values of each channel in the random area as the pixel value mean value of the channel:
f(x3:x3+h,y3:y3+w)=m,
wherein (x)3,y3) X is more than or equal to 0 and is the coordinate of a third arbitrary point in the pedestrian image3≤3K,0≤y3K is less than or equal to K, and m is the pixel value mean value of each channel in the pedestrian image;
step 1-4, carrying out data standardization processing on data of each channel of the pedestrian image: carrying out data normalization and normalization processing on any channel of the pedestrian image with the size of 3 KxK according to the following normalization function f (x):
Figure GDA0002976432950000031
wherein x is the pixel value of any point under each channel of the pedestrian image obtained in the step 1-3, x is more than or equal to 0 and less than or equal to 255, mu is the mean value of the public data set ImageNet, and delta is the standard deviation of the public data set ImageNet.
The step 2 comprises the following steps:
step 2-1, counting an index list corresponding to a pedestrian image of each identity in a training set, wherein the pedestrian image in the training set is a training sample, defining a dictionary set of an unsampled sample index list as US, a set of correct classification of a model as TS, a set of incorrect classification of the model as FS, initializing TS and FS as empty, and US as a dictionary set formed by all current training samples;
step 2-2, performing dynamic sampling, and acquiring a batch consisting of P pedestrians and Q images corresponding to the P pedestrians from the training set under the current iteration turn, so that the identities of the P pedestrians are randomly sampled from a label list of the training set;
step 2-3, preferentially sampling and acquiring Q images from the US set for each pedestrian identity acquired in the step 2-2, if the US set is empty or the number of the pedestrian images with the residual corresponding identities is less than Q, sampling and complementing from the FS set, if the number of the pedestrian images is still insufficient, sampling and complementing from the TS set, and if the number of the pedestrian images is still insufficient, circulating the step 2-3 and repeatedly sampling until Q images are acquired;
step 2-4, after each iteration sampling, transferring the samples sampled in the current iteration round from the US set to the FS set, simultaneously transferring the samples correctly classified by the model from the FS set to the TS set, and transferring the samples wrongly classified by the model from the TS set to the FS set;
step 2-5, the step 2-3 and the step 2-4 are circulated until a batch with the size of P multiplied by Q is obtained by sampling;
the step 3 comprises the following steps:
step 3-1, constructing a network model for pedestrian re-identification, wherein the network model comprises a backbone network model and sub-modules;
extracting multilayer features through a backbone network model, namely extracting features of different depths, wherein the features of different depths comprise: first layer depth feature l1Second layer depth feature l2The third layer is deepDegree characteristic l3And a fourth layer depth feature l4(ii) a The backbone network model selects a classical classification network ResNet of an ImageNet data set;
the sub-modules comprise an enhancement module, a downscaling module, a reduction module and a maximum pooling layer module; defining a first layer depth feature l1And a second layer depth feature l2A low level feature, a third level depth feature3And a fourth layer depth feature l4Then it is a high-level feature;
when the first layer depth feature l1When the size of the second layer is C multiplied by H multiplied by W, the second layer depth feature l is obtained according to the backbone network model2Is 2 CxH/2 xW/2, the third layer depth feature l3Has a size of 4 CxH/4 xW/4, wherein C is the first layer depth feature l1H is the first layer depth characteristic l1W is the first layer depth characteristic l1Is wide;
step 3-2, respectively enhancing the depth characteristics l of the first layer by two enhancing modules1And a second layer depth feature l2The size of the first layer depth feature l is kept unchanged, and then the first layer depth feature l passes through the two downscaling modules1And a second layer depth feature l2Are reduced to 2 CxH/4 xW/4, respectively;
step 3-3, reducing the third layer depth characteristic l by the reduction module3The number of the channels is half of the original number, namely the size is reduced to 2 CxH/4 xW/4;
the downscaled first layer depth feature l is1And a reduced third layer depth feature l3Splicing according to the channel dimension to obtain a first multilayer depth feature l with the size of 2 CxH/4 xW/413
The downscaled second layer depth feature l is2And a reduced third layer depth feature l3Splicing according to the channel dimension to obtain a second multilayer depth feature l with the size of 2 CxH/4 xW/423
Step 3-4, the multilayer depth characteristic l obtained in the step 3-3 is processed13And l23And a backboneThird layer depth feature l in network model3Fourth layer depth feature l respectively accessed into backbone network model4A corresponding network layer forming the multi-branch structure, the global features including: first global feature l4-1Second global feature l4-2And a third global feature l4-3(ii) a First global feature l4-1By third layer depth features l in the backbone network model3Accessing fourth layer depth feature l4The corresponding network layer obtains a fourth layer depth feature equivalent to the backbone network model, and a second global feature l4-2By a 123Accessing fourth layer depth feature l4The corresponding network layer obtains the third global feature l4-3Then pass through13Accessing fourth layer depth feature l4Obtaining a corresponding network layer;
segmenting the global features into component features, including: the first global feature l is combined4-1Cutting into first part features with granularity of 1, and dividing the second global features l into first part features with granularity of 14-2Cutting into second part features with granularity of 2, and dividing the third global features l4-3A third part feature cut to a grain size of 3;
pooling the resolutions of the global features and the component features to 1 × 1 by using a maximum pooling layer module, further reducing the number of channels of the global features and the component features to F, where F is generally 9, the reduction module is a convolution kernel shared by 1 × 1 convolution layers, the size of each of the global features and the component features after reduction is F × 1 × 1, and a set formed by the component features after reduction is denoted as S;
and splicing all the reduced global features and the reduced component features to obtain the depth representation of the constructed pedestrian image, wherein the size is M multiplied by F, and M is the total number of the global features and the component features.
Step 4 comprises the following steps:
step 4-1, defining experiment related configuration: before training a network model on a training set, firstly defining a model optimizer for updating parameters; setting the size of the batch of the dynamic sampling in the step 2 to be P multiplied by Q, wherein P represents the number of the pedestrian identities included in each batch, and Q represents the number of the pedestrian images included in each pedestrian identity; finally, a learning rate scheduler is set; the training set is provided with a pedestrian identity label, and the number of the pedestrian identity label classes of the training set is recorded as Y;
step 4-2, optimizing each global feature in the step 3 respectively: averaging each global feature by a modified ternary loss function for the feature metric, LtripletComprises the following steps:
Figure GDA0002976432950000061
where G denotes the number of global features, G-3,
Figure GDA0002976432950000062
an anchor sample representing the g-th global feature of the i-th pedestrian identity,
Figure GDA0002976432950000063
a positive sample of the g-th global feature representing the identity of the ith pedestrian,
Figure GDA0002976432950000064
a negative sample of the g global feature representing the identity of the j pedestrian; wherein alpha is a hyper-parameter for controlling the difference between the inter-class distance and the intra-class distance, alpha is more than 1.0 and less than 1.5, i is more than or equal to 1 and less than or equal to P, and a is more than or equal to 1 and less than or equal to Q;
step 4-3, optimizing each reduced component feature obtained in the step 3-4 by respectively using a cross entropy loss function of identity classification, wherein each component feature uses a linear classifier without bias items, the component features correspond to the linear classifiers one by one, and the cross entropy loss function L of the identity classificationidIs as follows;
Figure GDA0002976432950000065
wherein fcjDenotes the jth Linear classifier, fjqRepresenting the jth part characteristic fjJ is more than or equal to 1 and less than or equal to N, Q is more than or equal to 1 and less than or equal to PxQ of the vectors of the Q-th pedestrian image in a batch; n represents the total number of linear classifiers, i.e., the number of component features; 1r=yExpressing a one-hot coded vector with the length of the identity number of the pedestrian, wherein the index r of the one-hot element is equal to the identity true value y of the pedestrian image;
step 4-4, adding the cross entropy loss function and the improved ternary loss function to obtain a loss function L used in final training, which is as follows:
L=Ltriplet+Lid
and 4-5, performing model training of the network model on the training set.
In steps 4-5, when model training of the network model is performed on the training set, the input is as follows: training set D; a pedestrian identity tag y; the iteration number T; a sampler S, an optimizer OPT, a learning rate scheduler LR; initialization parameter theta0The index 0 is the current iteration number, the initial model phi (x; theta)0) (ii) a The output is: model phi (x; theta)T) (ii) a The specific training process comprises the following steps:
step 4-5-1, loading a pre-training model theta on the public data set ImageNet0
Step 4-5-2, the sampler S dynamically samples N from the training set D according to the configuration of step 3-1bIndividual preprocessed pedestrian image
Figure GDA0002976432950000071
xiRepresenting the ith pre-processed pedestrian image, where Nb=P×Q;
4-5-3, clearing the accumulated gradient by an optimizer OPT;
step 4-5-4, extracting global features and component features:
Figure GDA0002976432950000072
step 4-5-5, obtaining loss value by using loss function in step 3-4
Figure GDA0002976432950000073
4-5-6, performing back propagation according to the loss value loss;
step 4-5-7, the optimizer OPT updates the model parameter thetatMeanwhile, the learning rate scheduler LR updates the learning rate;
and 4-5-8, circularly and iteratively executing the steps 4-5-2 to 4-5-7 until the iteration number reaches T.
The step 5 comprises the following steps:
step 5-1, loading the network model trained in the step 4, and extracting pedestrian images in a test set by using the network model, namely extracting the depth representations of the query images in the query set and the queried images in the queried set;
all global features and component features in the test set are stitched together as defined in steps 3-4, each feature of the test set being represented as:
Figure GDA0002976432950000074
wherein N istestRepresents the test set, θTRepresenting a parameter set when the iteration number is T;
the depth characterization of the final extracted pedestrian image is as follows:
Figure GDA0002976432950000075
step 5-2, eliminating the deviation between a training set and a test set in a pedestrian image data set, and representing the depth of the pedestrian image
Figure GDA0002976432950000076
And depth characterization of the flipped pedestrian image
Figure GDA0002976432950000077
Additive, depth characterization of pedestrian images as test set
Figure GDA0002976432950000078
Step 5-3, normalizing the depth representation of the pedestrian image obtained in the step 5-2 by using a two-norm
Figure GDA0002976432950000079
The two-norm is calculated according to the following formula:
Figure GDA0002976432950000081
the depth characterization of the pedestrian image normalized using the two-norm to obtain the final test set is as follows:
Figure GDA0002976432950000082
step 5-4, according to the depth characterization of the pedestrian images of the final test set, calculating the distance between each pedestrian image in the query set and each pedestrian image in the queried set, obtaining the query result of each pedestrian image in the query set, and realizing pedestrian re-identification;
step 5-4 comprises: if the depth of each pedestrian image in the query set is characterized as
Figure GDA0002976432950000083
The depth of each pedestrian image in the queried set is characterized as
Figure GDA0002976432950000084
The distance matrix between the query set and the queried set is:
Figure GDA0002976432950000085
wherein N isgalleryRepresenting a queried set, NqueryRepresenting a set of queries, MjiIs the element of the ith row and the jth column of the matrix; according to the sequence from small to large, each is searchedAnd sequencing the distances between the inquiry images and each pedestrian image in the inquired set to obtain the identification result of each inquiry image.
Has the advantages that:
in the prior art, the problem of semantic information loss exists in the depth model based on the component because of high coupling of high-level features, but by adopting the method disclosed by the invention, the problem of semantic information loss of the high-level features can be inhibited through the depth model based on multiple layers and multiple granularities, so that the pedestrian re-recognition performance of the depth model based on the component is improved, and the method for constructing the pedestrian depth representation, training the model and finally finishing the pedestrian re-recognition is carried out based on data preprocessing, so that the optimal pedestrian re-recognition performance in the current stage is realized.
Drawings
The foregoing and/or other advantages of the invention will become further apparent from the following detailed description of the invention when taken in conjunction with the accompanying drawings.
FIG. 1 is a schematic flow chart of a multi-layer multi-granularity pedestrian re-identification depth model according to an embodiment of the present invention;
FIG. 2 is a schematic structural diagram of a multi-layer multi-granularity pedestrian re-identification depth model according to an embodiment of the present invention;
fig. 3 is a schematic diagram of a convolutional network structure of an enhancement module, a downscaling module and a reduction module in a multilayer multi-granularity pedestrian re-identification depth model according to an embodiment of the present invention, where a maximum pooling layer module is a basic network pooling layer;
fig. 4 is a diagram of an example of query results in a multi-layer multi-granularity pedestrian re-identification depth model according to an embodiment of the present invention.
Fig. 5 is a schematic diagram defining relative coordinates of 16 points including itself around any one point P00.
Detailed Description
The embodiment of the invention provides a pedestrian image identification method based on a deep network model, which is applied to rapidly analyzing monitoring video data of public security places, automatically finding out specific pedestrians, remarkably improving monitoring quality and having important significance on city construction and social security.
As shown in fig. 1, a schematic workflow diagram of a pedestrian image recognition method based on a depth network model provided in the embodiment of the present invention is provided, and the embodiment discloses a pedestrian image recognition method based on a depth network model, including:
step 1, preprocessing pedestrian images in a pedestrian image data set, comprising: in the step, the pedestrian image data set comprises a training set and a test set, the test set comprises a query set and a queried set, and specifically, the pedestrian image data set used in the invention is a pedestrian image data set with any public standard, such as Market-1501, DukeMTMC-reiD and the like. The data enhancement comprises random horizontal turning and random erasing, and the pedestrian image can be obtained through manual labeling or a pedestrian detection algorithm. In this embodiment, through carrying out data preprocessing to the pedestrian image in the pedestrian image data set, the variety of sample can be effectively improved.
Step 2, after data preprocessing is completed, dynamic sampling is needed to be carried out, sample batches for training are obtained, the index list corresponding to the pedestrian images of each identity in a training set is counted, a dictionary set of the sample index lists which are not sampled is defined as US, a set of correct classification of a model is defined as TS, a set of wrong classification of the model is defined as FS, TS and FS are initialized to be empty, US is a dictionary set formed by all current training samples, and then a dynamic sampling algorithm is executed;
step 3, constructing a network model for pedestrian re-identification, namely constructing a depth representation of the pedestrian image, and comprising the following steps: extracting multilayer features through a backbone network model, using sub-modules to enhance and reduce the scale of low-level features, combining the high-level features and the low-level features to obtain the multilayer features, forming a multi-branch structure, and extracting the component features and the global features of each branch. In this step, the global features of the branches are used for representing corresponding pedestrian images, and the sub-modules include a lateral enhancement module, a downscaling module, a reduction module and a maximum pooling layer module. The schematic diagram of the network structure in the multilayer multi-granularity pedestrian re-identification depth model provided by the embodiment of the invention is shown in fig. 2. In fig. 2, an arrow with a reference number 0 indicates each layer of the backbone network, an arrow with a reference number 1 indicates an enhancement module, an arrow with a reference number 2 indicates a downscaling module, an arrow with a reference number 3 indicates a maximum pooling layer module, and an arrow with a reference number 4 indicates a reduction module.
Step 4, training the network model constructed in step 3, including: defining experiment related configuration, and optimizing model parameters of the network model, specifically, in this embodiment, optimizing the model parameters by combining a cross entropy loss function of identity classification and an improved ternary loss function for feature measurement. The loss function used in the final training is the sum of the average cross entropy loss function for each component and the average modified ternary loss function for each global feature.
And 5, re-identifying the pedestrian, comprising the following steps: under the condition that the identity of the pedestrians in the test set and the identity of the pedestrians in the training set are not repeated, extracting the depth representation of the query image through the network model trained in the step 4, normalizing the depth representation of the query image by using a two-norm method, and returning the identification result of each query image according to the similarity of each query image and the queried set based on the cosine distance. In the step, the pedestrian is re-identified under the condition that the pedestrian identity is not repeated, and the effectiveness of the model can be verified through the returned identification result.
In the modern society, the monitoring video data of public safety places are quickly analyzed, specific pedestrians are automatically found, the monitoring quality can be obviously improved, and the method has important significance for city construction and social safety. The invention provides a multilayer and multi-granularity pedestrian re-identification depth model and realizes the best pedestrian re-identification performance at the present stage.
In the following, the steps of the present invention are described in detail, and in the multi-layer and multi-granularity pedestrian re-identification depth model according to this embodiment, the step 1 includes:
step 1-1, adjusting the size of an input pedestrian image by using a bicubic interpolation method, adjusting the size of the pedestrian image to 3K × K for any channel of pedestrian images with different sizes, and defining relative coordinates of 16 points including the pedestrian image at any point P00, as shown in fig. 5; -1. ltoreq. r.ltoreq.2, -1. ltoreq. c.ltoreq.2; where r and c respectively represent the offset of the abscissa and the offset of the ordinate, a negative value represents a leftward or upward offset, and a positive value represents a rightward or downward offset, e.g., P (0,1) is an adjacent point to the right of P (0, 0);
where P00 represents a certain pixel point (x) in the target interpolation graph1,y1) The closest mapping point in the original image, (x)1,y1) And the coordinate offset of P00 is (u, v), and the absolute coordinate of P00 in the original image is (i, j), the bicubic interpolation method is the sum of the above 16-point convolution interpolations, that is, the following interpolation function:
Figure GDA0002976432950000111
here, x1=i+u,y1J + v, f (i + r, j + c) represents the pixel value of the original image at any point of the 16 points in the original image, and s (x) is a sampling formula, specifically:
Figure GDA0002976432950000112
wherein, a is a formula coefficient, and a common value can be-0.5;
step 1-2, performing data enhancement by randomly and horizontally overturning the pedestrian image, comprising: for any channel of a pedestrian image with the size of 3 KxK, the pedestrian image is randomly horizontally overturned by the probability P1, 0 < P1 < 1, in the embodiment, the probability P1 is 0.5 in the practical experiment, and the pedestrian image is provided with a second arbitrary point (x)2,y2) The coordinates of the flipped symmetry point with respect to the horizontal direction are:
(xf,yf)=(x2,3K-y2-1)
wherein (x)2,y2) Is the coordinate of a second arbitrary point in the pedestrian image, x is more than or equal to 02≤3K,0≤y2≤K。
Step 1-3, performing data enhancement by randomly erasing the pedestrian image, including: for any channel of a pedestrian image with the size of 3 KxK, the probability P2, 0 < P2 < 1 is adopted, in the embodiment, the probability P2 is 0.5 in an actual experiment, a random area with the size of h x w is randomly erased according to the following random erasing function, and the pixel value of each channel in the random area is set as the pixel value mean value of the channel:
f(x3:x3+h,y3:y3+w)=m
wherein (x)3,y3) X is more than or equal to 0 and is the coordinate of a third arbitrary point in the pedestrian image3≤3K,0≤y3K is less than or equal to K, and m is the pixel value mean value of each channel in the pedestrian image.
Step 1-4, carrying out data standardization processing on data of each channel of the pedestrian image, wherein the data standardization processing comprises the following steps: data normalization and normalization processing is performed on any channel of a pedestrian image of 3K × K in size according to the following normalization and normalization function:
Figure GDA0002976432950000113
wherein x is the pedestrian image obtained in the step 1-3, x is more than or equal to 0 and less than or equal to 255, mu is the mean value of the public data set ImageNet, and delta is the standard deviation of the public data set ImageNet. In this embodiment, the mean and variance of each channel are actually used on the ImageNet data set, specifically, the mean of each channel of RGB is 0.485, 0.456, 0.406, and the variance is 0.229, 0.224, 0.225.
Step 2, after finishing data preprocessing, carrying out dynamic sampling to obtain sample batches for training;
step 2-1, counting an index list corresponding to a pedestrian image of each identity in a training set, defining a dictionary set of an index list of samples which are not sampled as US, a correctly classified set of models as TS, an incorrectly classified set of models as FS, initializing TS, FS as null, and US as a dictionary set formed by all current training samples;
step 2-2, performing dynamic sampling, and under the current iteration turn, acquiring a batch consisting of P pedestrians and Q images of each pedestrian from a training set, so that the identities of the P pedestrians are randomly sampled from a label list of the training set;
step 2-3, preferentially sampling and acquiring Q images from the US set for each pedestrian identity acquired in the step 2-2, if the US set is empty or the number of the pedestrian images with the residual corresponding identities is less than Q, sampling and complementing from the FS set, if the number of the pedestrian images is still insufficient, sampling and complementing from the TS set, and if the number of the pedestrian images is still insufficient, circulating the step and repeatedly sampling until Q images are acquired;
step 2-4, after each iteration sampling, transferring the samples sampled in the current iteration round from the US set to the FS set, simultaneously transferring the samples correctly classified by the model from the FS set to the TS set, and transferring the samples wrongly classified by the model from the TS set to the FS set;
step 2-5, the step 2-3 and the step 2-4 are circulated until a batch with the size of P multiplied by Q is obtained by sampling;
after that, a depth characterization of the pedestrian image needs to be constructed through step 3, in the multilayer and multi-granularity pedestrian re-identification depth model according to the embodiment, the step 3 includes:
step 3-1, extracting multilayer features through a backbone network model, wherein in this embodiment, the backbone network model refers to an existing basic deep convolutional neural network model, such as ResNet, VGG, and the like, and features of different depths can be extracted through a backbone network ResNet101, and the features of different depths include: first layer depth feature l1Second layer depth feature l2Third layer depth feature l3And a fourth layer depth feature l4In FIG. 2, the fourth layer depth feature l4Not shown, the sub-modules include an enhancement module, a downscaling module, a maximum pooling layer module, and a reduction module, and specifically, in fig. 2, an arrow labeled 0 indicates each layer of the backbone network, and the label is labeledThe arrow at 1 indicates the enhancement module, the arrow at 2 indicates the downscaling module, the arrow at 3 indicates the max pooling layer module, and the arrow at 4 indicates the reduction module.
As shown in fig. 3, a schematic diagram of a convolutional network structure of an enhancement module, a downscaling module and a reduction module in a multi-layer and multi-granularity pedestrian re-identification depth model provided in the embodiment of the present invention is provided, where Conv is a convolutional layer, the number after Conv is a convolutional kernel size of the convolutional layer, BatchNorm2d is a batch normalization layer, ReLU is a non-linear activation function layer, and a maximum pooling layer module is an existing common basic module, which is not given here.
In this embodiment, the step 3-2 includes: enhancing the first layer depth feature l by an enhancement module1And a second layer
Depth feature l2The characterization capability of (2). Scaling the first layer depth features l by two downscaling modules1And a second layer depth feature l2Is reduced to the reduced third layer depth feature l3Are consistent in size.
When the first layer depth feature l1When the size of (a) is C × H × W, in this embodiment, W is generally K/4, and H is generally 3W, and the second-layer depth feature l is obtained according to the backbone network model2Is 2 CxH/2 xW/2, the third layer depth feature l3Is 4 CxH/4 xW/4, reduced third layer depth feature l3Has a size of 2 CxH/4 xW/4, wherein C is the number of channels and H is the first layer depth feature l1Is 96 in this example, and W is the first layer depth feature l1Is 32 in this example.
Step 3-3, after passing through the two downscaling modules, the first layer depth feature l1And a second layer depth feature l2Is reduced to the reduced third layer depth feature l3The sizes of (A) and (B) are consistent, namely 2C multiplied by H/4 multiplied by W/4;
characterizing the first layer depth i1And a reduced third layer depth feature l3Splicing according to the channel dimension to obtain a template with the size of 4 CxH/4W/4 depth feature, dimension and third layer depth feature before reduction3Keeping consistent to obtain a first multilayer depth characteristic l with the size of 2 CxH/4 xW/413
Characterizing the second layer depth i2And third layer depth characteristic l3Splicing according to the channel dimension to obtain the depth feature with the size of 4 CxH/4 xW/4, the size and the third layer depth feature l before reduction3Keeping consistent to obtain a second multilayer depth characteristic l with the size of 2 CxH/4 xW/423
Step 3-4, the multilayer depth characteristic l obtained in the step 3-3 is processed13And l23And third layer depth characteristics l in the backbone network3Fourth layer depth feature l separately accessed in backbone network4A corresponding network layer forming the multi-branch structure, the global features including: first global feature l4-1Second global feature l4-2And a third global feature l4-3
Segmenting the global features into component features, including: the first global feature l is combined4-1Cutting into first part features with granularity of 1, and dividing the second global features l into first part features with granularity of 14-2Cutting into second part features with granularity of 2, and dividing the third global features l4-3A third part feature cut to a grain size of 3;
reducing the number of channels of the global features and the component features to F further by using a reduction module, pooling the sizes of the global features and the component features to 1 × 1, wherein the reduction module is a shared convolution kernel of 1 × 1 convolution layer, the size of each reduced global feature and component feature is F × 1 × 1, and a set formed by the reduced component features is marked as S; specifically, in this embodiment, F is 256.
And splicing all the reduced global features and component features to obtain a depth representation of the constructed pedestrian image, wherein the size is M × F, M is the total number of the global features and the component features, and specifically, in this embodiment, M is 9.
In the multilayer and multi-granularity pedestrian re-identification depth model according to this embodiment, the step 4 includes:
step 4-1, defining relevant configuration of the experiment, comprising: before training the pedestrian re-recognition model on the training set, firstly defining a model optimizer for updating parameters, specifically, in the embodiment, using an Adam optimizer, loading parameters of the pedestrian re-recognition model constructed in the step 3, and using an AMSGrad method; the batch size of the input images is set to be P multiplied by Q, wherein P represents the number of pedestrian identities included in each batch, and Q represents the number of pedestrian images included in each pedestrian identity. Specifically, in this embodiment, P is 12, and Q is 4; finally, a learning rate scheduler is set; the training set is contained in an open pedestrian image data set, the training set is provided with pedestrian identity labels, and the number of the pedestrian identity label classes of the training set is recorded as Y. Specifically, in this embodiment, a multistep learning rate scheduler multistep lr is used, and when the training reaches a preset iteration time point, the learning rate is reduced to be twice the original gamma, in this embodiment, the gamma is 0.1, and an iteration time point is preset every 40 iterations.
Step 4-2, optimizing each global feature in the step 3 respectively, including: averaging each global feature by a modified ternary loss function for the feature metric, the modified ternary loss function being:
Figure GDA0002976432950000141
where G denotes the number of global features, G-3,
Figure GDA0002976432950000142
an anchor sample representing the g-th global feature of the i-th pedestrian identity,
Figure GDA0002976432950000151
a positive sample of the g-th global feature representing the identity of the ith pedestrian,
Figure GDA0002976432950000152
g-th global feature representing jth pedestrian identityA negative sample of (d); wherein alpha is a hyper-parameter for controlling the difference between the inter-class distance and the intra-class distance, alpha is more than 1.0 and less than 1.5, i is more than or equal to 1 and less than or equal to P, a is more than or equal to 1 and less than or equal to Q, and in the embodiment, alpha is 1.2.
4-3, optimizing each reduced component feature obtained in the step 3-4 by using an identity-classified cross entropy loss function, in this embodiment, because identity classification needs to keep output dimensionality consistent with the number Y of pedestrian identity labels, a linear layer without a bias term needs to be added to each component feature, so that the component feature with dimensionality F sets the output dimensionality as Y through the linear layer, each component feature uses a linear classifier without a bias term, the component features correspond to the linear classifiers one to one, and the identity-classified cross entropy loss function is as follows;
Figure GDA0002976432950000153
wherein fcjDenotes the jth Linear classifier, fjqRepresenting the jth part characteristic fjThe vector of the qth pedestrian image in a batch, 1 ≦ j ≦ N, 1 ≦ Q ≦ PxQ, which represents the size of a batch, N representing the total number of linear classifiers, i.e., the number of component features, 1, as described in step 3-1r=yAnd the single-hot coded vector with the length of the identity number of the pedestrian is represented, wherein the index r of the single-hot element is equal to the identity true value y of the pedestrian image.
Step 4-4, adding the average cross entropy loss function of each component feature and the average improved ternary loss function of each global feature to obtain a loss function used in final training, as follows:
L=Ltriplet+Lid
and 4-5, performing model training of the network model on the training set. The specific training algorithm is as follows:
inputting: training set D; a pedestrian identity tag y; the iteration number T; a sampler S, an optimizer OPT, a learning rate scheduler LR; initialization parameter theta0Subscript is currentNumber of iterations, initial model Φ (x; θ)0);
And (3) outputting: model phi (x; theta)T);
Step 4-5-1, loading a pre-training model theta on the public data set ImageNet0
Step 4-5-2, the sampler S dynamically samples N from the training set D according to the configuration of step 3-1bIndividual preprocessed pedestrian image
Figure GDA0002976432950000154
xiRepresenting the ith pre-processed pedestrian image, where Nb=P×Q;
4-5-3, clearing the accumulated gradient by an optimizer OPT;
step 4-5-4, extracting global features and component features:
Figure GDA0002976432950000161
step 4-5-5, obtaining loss value by using loss function in step 3-4
Figure GDA0002976432950000162
4-5-6, performing back propagation according to the loss value loss;
step 4-5-7, the optimizer OPT updates the model parameter thetatMeanwhile, the learning rate scheduler LR updates the learning rate;
step 4-5-8, circularly and iteratively executing the step 4-5-2 to the step 4-5-7 until the iteration number reaches T;
wherein, the parameter subscript number t in the model output by the training algorithm represents the current iteration number, and the batch size Nb=P×Q。
In the multilayer and multi-granularity pedestrian re-identification depth model according to this embodiment, the step 5 includes:
and 5-1, loading the network model trained in the step 4, and extracting the depth characterization of the pedestrian image in a test set, wherein the test set comprises a query set and a queried set, namely extracting the query image and the depth characterization of the queried image by using the model.
As defined in steps 3-4, all global features and component features in the test set are stitched together, each feature of the test set being represented as:
Figure GDA0002976432950000163
wherein N istestRepresents the test set, θTRepresenting a parameter set when the iteration number is T;
the depth characterization of the final extracted pedestrian image is as follows:
Figure GDA0002976432950000164
and 5-2, eliminating the deviation between a training set and a test set in the enhanced pedestrian image data set, remarkably changing data distribution due to random horizontal inversion of the training set, and representing the depth of the pedestrian image by considering the inverted pedestrian image during specific test
Figure GDA0002976432950000165
And depth characterization of the flipped pedestrian image
Figure GDA0002976432950000166
Additive, pedestrian depth characterization as test set
Figure GDA0002976432950000167
Specifically, in this embodiment, the flipping function is shown as step 1-2.
Step 5-3, normalizing the pedestrian depth characterization obtained in the step 5-2 by using a two-norm method
Figure GDA0002976432950000168
The two-norm is calculated according to the following formula:
Figure GDA0002976432950000171
the pedestrian depth characterization using the two-norm normalization to obtain the final test set is:
Figure GDA0002976432950000172
step 5-4, according to the pedestrian depth representation of the final test set, calculating the distance between each pedestrian image in the query set and each pedestrian image in the queried set, obtaining the query result of each pedestrian image in the query set, and realizing pedestrian re-identification;
if the depth of each pedestrian image in the query set is characterized as
Figure GDA0002976432950000173
The depth of each pedestrian image in the queried set is characterized as
Figure GDA0002976432950000174
The distance matrix between the query set and the queried set is:
Figure GDA0002976432950000175
wherein N isgalleryRepresenting a queried set, NqueryRepresenting a set of queries;
the distances between each query image and each pedestrian image in all the queried sets are ranked according to the sequence from small to large, the smaller the distance between the pedestrian image in the queried set and the query image is, the higher the possibility that the pedestrian is the same is, and therefore the identification result of each query image can be obtained, and the first ten query results are generally taken for evaluation.
As shown in fig. 4, for the example diagram of the query result in the multilayer multi-granularity pedestrian re-identification depth model provided in the embodiment of the present invention, where √ represents a correct search, and √ represents an incorrect search, in each example query, the first row is the query result obtained in the present invention, and the second row is the query result of the classical component model PCB, it can be seen that the method of the present invention is significantly better than the classical component model PCB, and the best pedestrian re-identification performance at the present stage is achieved.
According to the technical scheme, the embodiment of the invention provides a multilayer and multi-granularity pedestrian re-identification depth model, which comprises the following steps: step 1, preprocessing pedestrian images in a pedestrian image data set, comprising: adjusting the size of the pedestrian image, enhancing data, and performing data normalization and standardization processing on the pedestrian image after data enhancement, wherein the pedestrian image data set comprises a training set, a query set and a queried set; (ii) a Step 3, constructing a network model for pedestrian re-identification, namely constructing a depth representation of the pedestrian image, and comprising the following steps: extracting multilayer features through a backbone network model, enhancing and fusing the multilayer features by using sub-modules to form a multi-branch structure, and extracting component features and global features of each branch; step 4, training the network model constructed in step 3, including: defining experiment related configuration, and optimizing model parameters of the network model; and 5, re-identifying the pedestrian, comprising the following steps: extracting the depth characterization of the query image through the network model trained in the step 4, normalizing the depth characterization of the query image by using a two-norm model, and returning the identification result of each query image according to the similarity of each query image and the queried set based on the cosine distance.
In the prior art, because a depth model based on a component often only segments high-level features with high coupling in a backbone network, semantic information beneficial to re-identification of the high-level features is lost due to segmentation, so that the performance of re-identification of pedestrians through the depth model of the component is unstable.
By adopting the method, the problem that semantic information is lost after high-level feature segmentation is solved through the depth features based on multiple layers and multiple granularities, so that the pedestrian re-recognition performance of the depth model based on the components is improved, the pedestrian depth characterization is constructed based on data preprocessing, the model is trained, the pedestrian re-recognition is finally completed, and the best pedestrian re-recognition performance in the current stage is realized.
In particular implementations, the present invention also provides a computer storage medium, where the computer storage medium may store a program that, when executed, may include some or all of the steps in embodiments of a multi-layered, multi-granular pedestrian re-identification depth model provided by the present invention. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM) or a Random Access Memory (RAM).
Those skilled in the art will readily appreciate that the techniques of the embodiments of the present invention may be implemented as software plus a required general purpose hardware platform. Based on such understanding, the technical solutions in the embodiments of the present invention may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.
The present invention provides a pedestrian image recognition method based on a deep network model, and the method and the way for implementing the technical solution are many, the above description is only the preferred embodiment of the present invention, it should be noted that, for those skilled in the art, without departing from the principle of the present invention, several improvements and embellishments can be made, and these improvements and embellishments should also be regarded as the protection scope of the present invention. All the components not specified in the present embodiment can be realized by the prior art.

Claims (3)

1. A pedestrian image identification method based on a depth network model is characterized by comprising the following steps:
step 1, preprocessing pedestrian images in a pedestrian image data set, wherein the pedestrian image data set comprises a training set and a testing set, the testing set comprises a query set and a queried set, the pedestrian identities in the testing set and the training set are not repeated, and the query set and the queried set have the same pedestrian identity;
step 2, dynamically sampling the preprocessed training set;
step 3, constructing a network model for pedestrian re-identification;
step 4, training the network model constructed in the step 3;
step 5, re-identifying the pedestrian;
the step 1 comprises the following steps:
step 1-1, adjusting the size of an input pedestrian image by using a bicubic interpolation method, adjusting the size of the pedestrian image to be 3K multiplied by K for any channel of pedestrian images with different sizes, and defining the relative coordinates of 16 points around the pedestrian image including the pedestrian image as P (r, c) for any point P (0,0) in the image, wherein r is more than or equal to 1 and less than or equal to 2, and c is more than or equal to 1 and less than or equal to 2; r, c respectively represent the offset of the abscissa and the offset of the ordinate, a negative value represents a leftward or upward offset, and a positive value represents a rightward or downward offset;
wherein P (0,0) represents a pixel point (x) in the target interpolation graph1,y1) (x) the closest mapping point in the original image1,y1) And the coordinate offset of P (0,0) is represented as (u, v), and the absolute coordinate of P (0,0) in the original image is represented as (i, j), the bicubic interpolation method is the sum of the above 16-point convolution interpolations, i.e., the following interpolation function F (i + u, j + v):
Figure FDA0002976432940000011
wherein x is1=i+u,y1J + v, f (i + r, j + c) represents the pixel value of any one point of the 16 points in the original image, and s (x) is a sampling formula, specifically:
Figure FDA0002976432940000012
wherein a is a formula coefficient;
step 1-2, passing random waterFlat flipping the pedestrian image: randomly horizontally flipping any channel of a pedestrian image with the size of 3 KxK with the probability P1 of 0 < P1 < 1 and a second arbitrary point (x) on the pedestrian image2,y2) Coordinates (x) of the symmetrical point after turning in the horizontal directionf,yf) Comprises the following steps:
(xf,yf)=(x2,3K-y2-1)
wherein (x)2,y2) Is the coordinate of a second arbitrary point in the pedestrian image, x is more than or equal to 02≤3K,0≤y2≤K;
Step 1-3, by randomly erasing the pedestrian image: randomly erasing a random area with the size of h multiplied by w according to the following random erasing function f () with the probability P2, 0 < P2 < 1 for any channel of a pedestrian image with the size of 3 KxK, and setting all pixel values of each channel in the random area as the pixel value mean value of the channel:
f(x3:x3+h,y3:y3+w)=m,
wherein (x)3,y3) X is more than or equal to 0 and is the coordinate of a third arbitrary point in the pedestrian image3≤3K,0≤y3K is less than or equal to K, and m is the pixel value mean value of each channel in the pedestrian image;
step 1-4, carrying out data standardization processing on data of each channel of the pedestrian image: carrying out data normalization and normalization processing on any channel of the pedestrian image with the size of 3 KxK according to the following normalization function f (x):
Figure FDA0002976432940000021
wherein x is the pixel value of any point under each channel of the pedestrian image obtained in the step 1-3, x is more than or equal to 0 and less than or equal to 255, mu is the mean value of the public data set ImageNet, and delta is the standard deviation of the public data set ImageNet;
the step 2 comprises the following steps:
step 2-1, counting an index list corresponding to a pedestrian image of each identity in a training set, wherein the pedestrian image in the training set is a training sample, defining a dictionary set of an unsampled sample index list as US, a set of correct classification of a model as TS, a set of incorrect classification of the model as FS, initializing TS and FS as empty, and US as a dictionary set formed by all current training samples;
step 2-2, performing dynamic sampling, and acquiring a batch consisting of P pedestrians and Q images corresponding to the P pedestrians from the training set under the current iteration turn, so that the identities of the P pedestrians are randomly sampled from a label list of the training set;
step 2-3, preferentially sampling and acquiring Q images from the US set for each pedestrian identity acquired in the step 2-2, if the US set is empty or the number of the pedestrian images with the residual corresponding identities is less than Q, sampling and complementing from the FS set, if the number of the pedestrian images is still insufficient, sampling and complementing from the TS set, and if the number of the pedestrian images is still insufficient, circulating the step 2-3 and repeatedly sampling until Q images are acquired;
step 2-4, after each iteration sampling, transferring the samples sampled in the current iteration round from the US set to the FS set, simultaneously transferring the samples correctly classified by the model from the FS set to the TS set, and transferring the samples wrongly classified by the model from the TS set to the FS set;
step 2-5, the step 2-3 and the step 2-4 are circulated until a batch with the size of P multiplied by Q is obtained by sampling;
the step 3 comprises the following steps:
step 3-1, constructing a network model for pedestrian re-identification, wherein the network model comprises a backbone network model and sub-modules;
extracting multilayer features through a backbone network model, namely extracting features of different depths, wherein the features of different depths comprise: first layer depth feature l1Second layer depth feature l2Third layer depth feature l3And a fourth layer depth feature l4(ii) a The backbone network model selects a classical classification network ResNet of an ImageNet data set;
the sub-modules comprise an enhancement module, a downscaling module, a reduction module and a maximum pooling layer module; defining a first layer depth feature l1And a second layer depth feature l2Is a low layer characteristicCharacterization, third layer depth feature l3And a fourth layer depth feature l4Then it is a high-level feature;
when the first layer depth feature l1When the size of the second layer is C multiplied by H multiplied by W, the second layer depth feature l is obtained according to the backbone network model2Is 2 CxH/2 xW/2, the third layer depth feature l3Has a size of 4 CxH/4 xW/4, wherein C is the first layer depth feature l1H is the first layer depth characteristic l1W is the first layer depth characteristic l1Is wide;
step 3-2, respectively enhancing the depth characteristics l of the first layer by two enhancing modules1And a second layer depth feature l2The size of the first layer depth feature l is kept unchanged, and then the first layer depth feature l passes through the two downscaling modules1And a second layer depth feature l2Are reduced to 2 CxH/4 xW/4, respectively;
step 3-3, reducing the third layer depth characteristic l by the reduction module3The number of the channels is half of the original number, namely the size is reduced to 2 CxH/4 xW/4;
the downscaled first layer depth feature l is1And a reduced third layer depth feature l3Splicing according to the channel dimension to obtain a first multilayer depth feature l with the size of 2 CxH/4 xW/413
The downscaled second layer depth feature l is2And a reduced third layer depth feature l3Splicing according to the channel dimension to obtain a second multilayer depth feature l with the size of 2 CxH/4 xW/423
Step 3-4, the multilayer depth characteristic l obtained in the step 3-3 is processed13And l23And third layer depth features l in the backbone network model3Fourth layer depth feature l respectively accessed into backbone network model4The corresponding network layer forms a multi-branch structure, and the global characteristics comprise: first global feature l4-1Second global feature l4-2And a third global feature l4-3(ii) a First global feature l4-1By third layer depth in the backbone network modelCharacteristic l3Accessing fourth layer depth feature l4The corresponding network layer obtains a fourth layer depth feature equivalent to the backbone network model, and a second global feature l4-2By a 123Accessing fourth layer depth feature l4The corresponding network layer obtains the third global feature l4-3Then pass through13Accessing fourth layer depth feature l4Obtaining a corresponding network layer;
segmenting the global features into component features, including: the first global feature l is combined4-1Cutting into first part features with granularity of 1, and dividing the second global features l into first part features with granularity of 14-2Cutting into second part features with granularity of 2, and dividing the third global features l4-3A third part feature cut to a grain size of 3;
pooling the resolutions of the global features and the component features to 1 × 1 by using a maximum pooling layer module, further reducing the number of channels of the global features and the component features to F by using a reduction module, wherein the reduction module is a shared convolution kernel of 1 × 1 convolution layer, the size of each reduced global feature and component feature is F × 1 × 1, and a set formed by the reduced component features is denoted as S;
splicing all the reduced global features and component features to obtain the depth representation of the constructed pedestrian image, wherein the size is M multiplied by F, and M is the total number of the global features and the component features;
step 4 comprises the following steps:
step 4-1, defining experiment related configuration: before training a network model on a training set, firstly defining a model optimizer for updating parameters; setting the size of the batch of the dynamic sampling in the step 2 to be P multiplied by Q, wherein P represents the number of the pedestrian identities included in each batch, and Q represents the number of the pedestrian images included in each pedestrian identity; finally, a learning rate scheduler is set; the training set is provided with a pedestrian identity label, and the number of the pedestrian identity label classes of the training set is recorded as Y;
step 4-2, optimizing each global feature in the step 3 respectively: averaging each global feature by a modified ternary penalty function for the feature metricLoss function LtripletComprises the following steps:
Figure FDA0002976432940000051
wherein G represents the number of global features,
Figure FDA0002976432940000052
an anchor sample representing the g-th global feature of the i-th pedestrian identity,
Figure FDA0002976432940000053
a positive sample of the g-th global feature representing the identity of the ith pedestrian,
Figure FDA0002976432940000054
a negative sample of the g global feature representing the identity of the j pedestrian; wherein alpha is a hyper-parameter for controlling the difference between the inter-class distance and the intra-class distance, alpha is more than 1.0 and less than 1.5, i is more than or equal to 1 and less than or equal to P, and a is more than or equal to 1 and less than or equal to Q;
step 4-3, optimizing each reduced component feature obtained in the step 3-4 by respectively using a cross entropy loss function of identity classification, wherein each component feature uses a linear classifier without bias items, the component features correspond to the linear classifiers one by one, and the cross entropy loss function L of the identity classificationidIs as follows;
Figure FDA0002976432940000055
wherein fcjDenotes the jth Linear classifier, fjqRepresenting the jth part characteristic fjJ is more than or equal to 1 and less than or equal to N, Q is more than or equal to 1 and less than or equal to PxQ of the vectors of the Q-th pedestrian image in a batch; n represents the total number of linear classifiers, i.e., the number of component features; 1r=yExpressing a one-hot coded vector with the length of the identity number of the pedestrian, wherein the index r of the one-hot element is equal to the identity true value y of the pedestrian image;
step 4-4, adding the cross entropy loss function and the improved ternary loss function to obtain a loss function L used in final training, which is as follows:
L=Ltriplet+Lid
4-5, performing model training of a network model on the training set;
in steps 4-5, when model training of the network model is performed on the training set, the input is as follows: training set D; a pedestrian identity tag y; the iteration number T; a sampler S, an optimizer OPT, a learning rate scheduler LR; initialization parameter theta0The index 0 is the current iteration number, the initial model phi (x; theta)0) (ii) a The output is: model phi (x; theta)T) (ii) a The specific training process comprises the following steps:
step 4-5-1, loading a pre-training model theta on the public data set ImageNet0
Step 4-5-2, the sampler S dynamically samples N from the training set D according to the configuration of step 3-1bIndividual preprocessed pedestrian image
Figure FDA0002976432940000061
xiRepresenting the ith pre-processed pedestrian image, where Nb=P×Q;
4-5-3, clearing the accumulated gradient by an optimizer OPT;
step 4-5-4, extracting global features and component features:
Figure FDA0002976432940000062
step 4-5-5, obtaining loss value by using loss function in step 3-4
Figure FDA0002976432940000063
4-5-6, performing back propagation according to the loss value loss;
step 4-5-7, the optimizer OPT updates the model parameter thetatMeanwhile, the learning rate scheduler LR updates the learning rate;
and 4-5-8, circularly and iteratively executing the steps 4-5-2 to 4-5-7 until the iteration number reaches T.
2. The method of claim 1, wherein step 5 comprises:
step 5-1, loading the network model trained in the step 4, and extracting pedestrian images in a test set by using the network model, namely extracting the depth representations of the query images in the query set and the queried images in the queried set;
all global features and component features in the test set are stitched together as defined in steps 3-4, each feature of the test set being represented as:
Figure FDA0002976432940000064
wherein N istestRepresents the test set, θTRepresenting a parameter set when the iteration number is T;
the depth characterization of the final extracted pedestrian image is as follows:
Figure FDA0002976432940000065
step 5-2, eliminating the deviation between a training set and a test set in a pedestrian image data set, and representing the depth of the pedestrian image
Figure FDA0002976432940000066
And depth characterization of the flipped pedestrian image
Figure FDA0002976432940000067
Additive, depth characterization of pedestrian images as test set
Figure FDA0002976432940000068
Step 5-3, normalizing the result of step 5-2 using a two-normDepth characterization of pedestrian images
Figure FDA0002976432940000069
The two-norm is calculated according to the following formula:
Figure FDA0002976432940000071
the depth characterization of the pedestrian image normalized using the two-norm to obtain the final test set is as follows:
Figure FDA0002976432940000072
and 5-4, calculating the distance between each pedestrian image in the query set and each pedestrian image in the queried set according to the depth characterization of the pedestrian image in the final test set, obtaining the query result of each pedestrian image in the query set, and realizing pedestrian re-identification.
3. The method of claim 2, wherein steps 5-4 comprise: if the depth of each pedestrian image in the query set is characterized as
Figure FDA0002976432940000073
The depth of each pedestrian image in the queried set is characterized as
Figure FDA0002976432940000074
The distance matrix between the query set and the queried set is:
Figure FDA0002976432940000075
wherein N isgalleryRepresenting a queried set, NqueryRepresenting a set of queries, MjiElements representing the ith row and the jth column in the matrix; according to the order from small to largeAnd sequencing the distance between each query image and each pedestrian image in the query set to obtain the identification result of each query image.
CN201911362901.4A 2019-12-26 2019-12-26 Pedestrian image identification method based on depth network model Active CN111177447B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911362901.4A CN111177447B (en) 2019-12-26 2019-12-26 Pedestrian image identification method based on depth network model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911362901.4A CN111177447B (en) 2019-12-26 2019-12-26 Pedestrian image identification method based on depth network model

Publications (2)

Publication Number Publication Date
CN111177447A CN111177447A (en) 2020-05-19
CN111177447B true CN111177447B (en) 2021-04-30

Family

ID=70655664

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911362901.4A Active CN111177447B (en) 2019-12-26 2019-12-26 Pedestrian image identification method based on depth network model

Country Status (1)

Country Link
CN (1) CN111177447B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111783526B (en) * 2020-05-21 2022-08-05 昆明理工大学 Cross-domain pedestrian re-identification method using posture invariance and graph structure alignment
CN111882548A (en) * 2020-07-31 2020-11-03 北京小白世纪网络科技有限公司 Method and device for counting cells in pathological image based on deep learning
CN112926569B (en) * 2021-03-16 2022-10-18 重庆邮电大学 Method for detecting natural scene image text in social network

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109101865A (en) * 2018-05-31 2018-12-28 湖北工业大学 A kind of recognition methods again of the pedestrian based on deep learning
CN110008842A (en) * 2019-03-09 2019-07-12 同济大学 A kind of pedestrian's recognition methods again for more losing Fusion Model based on depth
CN110046553A (en) * 2019-03-21 2019-07-23 华中科技大学 A kind of pedestrian weight identification model, method and system merging attributive character
CN110059616A (en) * 2019-04-17 2019-07-26 南京邮电大学 Pedestrian's weight identification model optimization method based on fusion loss function
CN110096947A (en) * 2019-03-15 2019-08-06 昆明理工大学 A kind of pedestrian based on deep learning recognizer again
CN110188611A (en) * 2019-04-26 2019-08-30 华中科技大学 A kind of pedestrian recognition methods and system again introducing visual attention mechanism
CN110472591A (en) * 2019-08-19 2019-11-19 浙江工业大学 It is a kind of that pedestrian's recognition methods again is blocked based on depth characteristic reconstruct

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109101865A (en) * 2018-05-31 2018-12-28 湖北工业大学 A kind of recognition methods again of the pedestrian based on deep learning
CN110008842A (en) * 2019-03-09 2019-07-12 同济大学 A kind of pedestrian's recognition methods again for more losing Fusion Model based on depth
CN110096947A (en) * 2019-03-15 2019-08-06 昆明理工大学 A kind of pedestrian based on deep learning recognizer again
CN110046553A (en) * 2019-03-21 2019-07-23 华中科技大学 A kind of pedestrian weight identification model, method and system merging attributive character
CN110059616A (en) * 2019-04-17 2019-07-26 南京邮电大学 Pedestrian's weight identification model optimization method based on fusion loss function
CN110188611A (en) * 2019-04-26 2019-08-30 华中科技大学 A kind of pedestrian recognition methods and system again introducing visual attention mechanism
CN110472591A (en) * 2019-08-19 2019-11-19 浙江工业大学 It is a kind of that pedestrian's recognition methods again is blocked based on depth characteristic reconstruct

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CLUSTERING AND DYNAMIC SAMPLING BASED UNSUPERVISED DOMAIN ADAPTATION FOR PERSON RE-IDENTIFICATION;Jinlin Wu,Zhen Lei等;《2019 IEEE International Conference on Multimedia and Expo》;20190712;886-888 *
Learning Discriminative Features with Multiple Granularities for Person Re-Identification;Guanshuo Wang等;《Proceedings of the 26th ACM international conference on Multimedia》;20181226;274-282 *

Also Published As

Publication number Publication date
CN111177447A (en) 2020-05-19

Similar Documents

Publication Publication Date Title
CN111080628B (en) Image tampering detection method, apparatus, computer device and storage medium
Xie et al. Multilevel cloud detection in remote sensing images based on deep learning
CN110443143B (en) Multi-branch convolutional neural network fused remote sensing image scene classification method
US11200424B2 (en) Space-time memory network for locating target object in video content
CN109754015B (en) Neural networks for drawing multi-label recognition and related methods, media and devices
EP3388978B1 (en) Image classification method, electronic device, and storage medium
US8503792B2 (en) Patch description and modeling for image subscene recognition
EP2701098B1 (en) Region refocusing for data-driven object localization
US8705866B2 (en) Region description and modeling for image subscene recognition
Jung et al. A unified spectral-domain approach for saliency detection and its application to automatic object segmentation
CN111177447B (en) Pedestrian image identification method based on depth network model
US10262214B1 (en) Learning method, learning device for detecting lane by using CNN and testing method, testing device using the same
US8503768B2 (en) Shape description and modeling for image subscene recognition
CN110909618B (en) Method and device for identifying identity of pet
US10275667B1 (en) Learning method, learning device for detecting lane through lane model and testing method, testing device using the same
CN111709313B (en) Pedestrian re-identification method based on local and channel combination characteristics
CN107784288A (en) A kind of iteration positioning formula method for detecting human face based on deep neural network
CN108154133B (en) Face portrait-photo recognition method based on asymmetric joint learning
CN107862680B (en) Target tracking optimization method based on correlation filter
KR20200027887A (en) Learning method, learning device for optimizing parameters of cnn by using multiple video frames and testing method, testing device using the same
CN114332544B (en) Image block scoring-based fine-grained image classification method and device
CN109472733A (en) Image latent writing analysis method based on convolutional neural networks
CN110781817B (en) Pedestrian re-identification method for solving component misalignment
Srinagesh et al. A modified shape feature extraction technique for image retrieval
AU2009347563A1 (en) Detection of objects represented in images

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant