CN111709313B - Pedestrian re-identification method based on local and channel combination characteristics - Google Patents

Pedestrian re-identification method based on local and channel combination characteristics Download PDF

Info

Publication number
CN111709313B
CN111709313B CN202010460902.9A CN202010460902A CN111709313B CN 111709313 B CN111709313 B CN 111709313B CN 202010460902 A CN202010460902 A CN 202010460902A CN 111709313 B CN111709313 B CN 111709313B
Authority
CN
China
Prior art keywords
pedestrian
picture
pictures
network
multiplied
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010460902.9A
Other languages
Chinese (zh)
Other versions
CN111709313A (en
Inventor
徐尔立
翁立
王建中
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN202010460902.9A priority Critical patent/CN111709313B/en
Publication of CN111709313A publication Critical patent/CN111709313A/en
Application granted granted Critical
Publication of CN111709313B publication Critical patent/CN111709313B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/103Static body considered as a whole, e.g. static pedestrian or occupant recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a pedestrian re-identification method based on local and channel combination characteristics. According to the invention, various shielding situations are simulated in a data enhancement mode, and the robustness of the shielding problem is improved. The picture is simultaneously scaled, rotated and translated using the STN to align the pedestrian picture. And horizontally dividing the picture to obtain the characteristics of different parts. For the global features of the whole pedestrian picture, the pedestrian identity is correctly classified through classification loss, the features of the same pedestrian are distributed more closely through similarity loss, and the features of different pedestrians are distributed more distantly. For local and channel combination features, different patterns on these different body parts are compared by similarity loss. Finally, the two features are fused to be used as the pedestrian descriptors, and the discrimination of the pedestrian descriptors is further improved. Through improving the anti-sheltering and discrimination ability of the pedestrian descriptor, the pedestrian re-identification can be more accurate.

Description

Pedestrian re-identification method based on local and channel combination characteristics
Technical Field
The invention belongs to the field of computer vision and image retrieval, and relates to a pedestrian re-identification method based on local and channel combination characteristics. The method solves some common problems in the field of pedestrian re-identification.
Background
With the development and popularization of monitoring systems, more and more pedestrian image data are urgently pending. The pedestrian re-identification technology is to find out the image of the pedestrian from the images of the pedestrians shot by a certain camera and the images of the pedestrians shot by other cameras. The intelligent pedestrian detection system has wide application scenes in real life, such as intelligent security, criminal investigation, man-machine interaction and the like, and is also closely related to other fields such as pedestrian detection, pedestrian tracking and the like.
The pedestrian re-identification method commonly used at present is based on a Convolutional Neural Network (CNN). Some approaches therefore aim to design or refine network models to extract more discriminative pedestrian image features, such as the residual network ResNet-50 pre-trained on ImageNet data sets and fine-tuned on the pedestrian re-identification data sets. While some methods work on improving or designing the loss functions, the loss functions are mainly classified into two categories: 1) classifying the loss, treating each pedestrian as a particular class, such as cross-entropy loss (cross-entropy loss); 2) and similarity loss, namely, the relationship which restricts the similarity between pedestrian images, such as contrast loss (contrast loss), triple loss (triple loss) and quadruple loss (quadruple loss).
Disclosure of Invention
Aiming at the problems in the existing pedestrian re-identification field, the invention provides a pedestrian re-identification method based on local and channel combination characteristics. The method has the following advantages: 1) the network model improves the capability of resisting the shielding problem in a data enhancement mode; 2) solving the problem of misalignment of images of pedestrians through a Spatial Transformer Network (STN); 3) the local and channel combination characteristics with more discrimination are obtained by cutting the characteristic diagram and grouping the characteristic diagram channels; 4) by applying different loss functions to different features, the discrimination of the features is further improved. The method provided by the invention comprehensively solves the main problems of shielding, misalignment, large pedestrian appearance change and the like in pedestrian re-identification, so that the method has more accurate identification capability.
A pedestrian re-identification method based on local and channel combination features comprises the following procedures:
firstly, a training process: the neural network is trained to obtain the best network parameters. The samples in the training dataset consist of a pedestrian picture x and its corresponding pedestrian identity id (x), id (x) e { 1. C represents the total number of the identities of the pedestrians, and the pedestrian with one identity has a plurality of pictures. The method comprises the following specific steps:
Step 1, sampling samples in a training set to generate small-batch data:
a small batch of data contains P x K pictures, namely P pedestrians with different identities, and K pictures of each pedestrian. If the number of pictures of one pedestrian is more than K in the training set, randomly sampling K pictures; if the number of the pictures is less than K, all the pictures are sampled, and the sampling is not enough and repeated.
Step 2, improving the anti-shielding capability of the model in a data enhancement mode:
2-1, generating a picture Pool (Pool) capable of storing pictures with different resolutions;
2-2, before each picture is input into the network, it will be in p 1 And copying one small picture into Pool in a probability mode. Assuming that the resolution of a picture is H × W, the resolution of a small picture, i.e., a picture block, randomly falls within the interval [0.1H,0.2H ]]×[0.1W,0.2W]In between, the locations are also randomly selected.
2-3, then with p 2 And randomly selecting a picture block from Pool according to the probability to cover the picture, and randomly selecting the covered position.
Step 3, loading a pre-training network:
using the ResNet-50 network pre-trained on the ImageNet dataset, the structure before the Global Average Pooling (GAP) Layer of the network is preserved and the step size of the last Convolutional Layer (Convolutional Layer) is set to 1, which is denoted as the "Convolutional basis network". After a picture with the resolution of 256 × 128 is input into the convolution base network, a tensor eigenmap T with the size of 16 × 8 × 2048 is output.
And 4, grouping the channels to obtain the characteristics of each group of channels:
dividing the tensor eigen map T with the size of 16 multiplied by 8 multiplied by 2048 obtained in the step 3 into 4 groups along the channel (namely the last dimension) on average, wherein the size of the tensor eigen map of each group is 16 multiplied by 8 multiplied by 512 and is respectively recorded as T 1 ,T 2 ,T 3 ,T 4
And 5, cutting the tensor characteristic diagram to obtain local characteristics:
each obtained in step 4Eigenmap of group tensor T 1 ,T 2 ,T 3 ,T 4 The image is equally divided into 4 local tensor eigenmaps along the horizontal direction, the size of each local tensor eigenmap is 4 multiplied by 8 multiplied by 512, and the local tensor eigenmaps are respectively marked as T 11 ~T 14 ,T 21 ~T 24 ,T 31 ~T 34 ,T 41 ~T 44 . T obtaining 16 local tensor eigenmaps T through steps 4 and 5 11 ~T 14 ,T 21 ~T 24 ,T 31 ~T 34 ,T 41 ~T 44 . Each local tensor eigenmap represents the combined features of different locations and different channels.
And 6, compressing the characteristic diagram:
and (3) convolving the tensor eigen map T, wherein the size of a convolution kernel is 16 multiplied by 8 multiplied by 512, the number of the convolution kernels is 512, and parameters are initialized randomly to obtain the global features g with the size of 1 multiplied by 512. Same pair T 11 ~T 14 ,T 21 ~T 24 ,T 31 ~T 34 ,T 41 ~T 44 Performing convolution respectively, wherein the convolution kernel size corresponding to each local tensor eigenmap is 4 multiplied by 8 multiplied by 512, the number is 512, parameters are initialized randomly, and 16 local channel combined features pc with the size of 1 multiplied by 512 are obtained 1 ~pc 16
Step 7, applying different loss functions to different characteristics:
Combining features pc for local channels 1 ~pc 16 Batch Hard sample triple Loss (Batch Hard Triplet Loss) was applied separately:
Figure BDA0002510875020000031
in the formula (1), X represents the small batch of data obtained by sampling in the step 1, and theta represents the parameter of the network.
Figure BDA0002510875020000032
Representing that the ith pedestrian corresponds to the a picture in the K pictures,
Figure BDA0002510875020000033
representing that the ith pedestrian corresponds to the p picture in the K pictures, and calling the p picture as a positive sample pair because the two pictures belong to the same pedestrian;
Figure BDA0002510875020000034
represents that the jth pedestrian corresponds to the nth picture in the K pictures because
Figure BDA0002510875020000035
And
Figure BDA0002510875020000036
belonging to different pedestrians and called as a negative sample pair. f. of θ (x) D (x, y) represents the Euclidean Distance (Euclidean Distance) between the feature x and the feature y. m is a constant that constrains the relationship between the two feature pairs distance, [ x ]] + Max (0, x). A picture of a pedestrian
Figure BDA0002510875020000037
In other words, traversing the pedestrian corresponds to each of the K pictures
Figure BDA0002510875020000041
Find a particular
Figure BDA0002510875020000042
So that
Figure BDA0002510875020000043
And
Figure BDA0002510875020000044
the Euclidean distance between the two characteristics obtained after the input network operation is maximum respectively,
Figure BDA0002510875020000045
namely a positive and difficult sample pair; meanwhile, each picture (P-1) multiplied by K pictures in total) of the rest pedestrians is traversed
Figure BDA0002510875020000046
Find a particular
Figure BDA0002510875020000047
So that
Figure BDA0002510875020000048
And
Figure BDA0002510875020000049
the Euclidean distance between the two characteristics obtained after the input network operation is respectively minimum,
Figure BDA00025108750200000410
Namely a negative difficulty sample pair. The loss function finds out the corresponding positive difficulty and negative difficulty sample pairs of each picture of each pedestrian, and constrains the relationship between the positive difficulty sample pairs characteristic distance and the negative difficulty sample pairs characteristic distance.
For the feature pc 1 And the Batch Hard triple Loss is as follows:
Figure BDA00025108750200000411
in the formula (2)
Figure BDA00025108750200000412
Representing a feature pc extracted from the a picture of the ith pedestrian 1
Figure BDA00025108750200000413
Representing features pc extracted from the p picture of the ith pedestrian 1
Figure BDA00025108750200000414
Representing features pc extracted from the nth picture of the jth pedestrian 1
For the global signature g, the Batch Hard Triplet Loss and the Softmax Loss are applied, respectively. The Batch Hard triple Loss is as follows:
Figure BDA00025108750200000415
in the formula (3)
Figure BDA00025108750200000416
Representing the feature g extracted from the a picture of the ith pedestrian,
Figure BDA00025108750200000417
representing the feature g extracted from the p picture of the ith pedestrian,
Figure BDA00025108750200000418
representing the feature g extracted from the nth picture of the jth pedestrian. Before applying Softmax Loss, g needs to be imported into a Fully Connected Layer (FC Layer). The number of output neurons of the full connection layer is the total number C of the pedestrian identities of the training set, and parameters of the full connection layer are initialized randomly. The Softmax Loss of the global feature g is:
Figure BDA00025108750200000419
in the formula (4)
Figure BDA0002510875020000051
Representing a feature g extracted from the jth picture of the ith pedestrian,
Figure BDA0002510875020000052
Representing the identity of the pedestrian corresponding to the picture.
Figure BDA0002510875020000053
Represents FClayer number
Figure BDA0002510875020000054
Weight corresponding to each output neuron, W k Represents the weight corresponding to the k output neuron of the FClayer.
The overall loss function of the network is:
Figure BDA0002510875020000055
in formula (5) < lambda > 123 For three lost weights, satisfy λ 123 =1。
And 8, recording the network constructed in the step 3-6 as N. Using a gradient descent algorithm, the Loss function Loss in step 7 is derived and the learnable parameters in N are optimized by back propagation.
Step 9, aligning the feature graph by using a space transformation network:
9-1, outputting a characteristic diagram F from a 4 th Block (Res 4 Block) of the convolution base network in N 4 (three-dimensional tensor) obtaining a vector theta (theta) with the length of 6 through a residual connecting Block (Res Block, parameter random initialization) and a GAP layer 111213212223 ). Wherein theta is 11122122 For scaling and rotating the characteristic map, theta 1323 For translating the feature map.
9-2, using theta 111213212223 Outputting a profile F for Block 2 (Res2Block) of a convolutional base network in N 2 (tensor with size of H multiplied by W multiplied by C) is affine transformed to obtain a blank feature map F " 2 . To F 2 For the eigen map (tensor of H × W size) of the channel c, the coordinate of the pixel point on the eigen map is (x) s ,y s ) After affine transformation becomes (x) t ,y t ) The relationship between the two is:
Figure BDA0002510875020000056
9-3, blank feature map F according to formula (6) " 2 From F 2 Filling the upper sampling pixel to obtain an aligned feature map F' 2 . . In the affine process, aOccurrence of F " 2 F corresponding to middle coordinate 2 Coordinate exceeds F 2 In the original range, the pixel value is set to 0 for these coordinates. Occurrence of F " 2 F corresponding to middle coordinate 2 When the coordinates are not pixel points, filling pixel values to the coordinates through bilinear interpolation:
Figure BDA0002510875020000061
in the formula (7)
Figure BDA0002510875020000062
Is F' 2 The pixel value of the (m, n) position on the c-channel of (c),
Figure BDA0002510875020000063
is F 2 The (m, n) position on the c-channel of (c).
Step 10, processing the aligned characteristic graph:
for aligned feature map F' 2 Inputting the data into a new convolution network, wherein the new convolution network is formed by stacking Res 3 Block, Res 4 Block and Res 5 Block in ResNet-50 network pre-trained on ImageNet data set, and outputting a feature map T with the same size as the feature map T in step 3 align . For T align The same operation as in step 3-6 is performed, and 1 global feature g is obtained align And 16 local and channel combination features
Figure BDA0002510875020000064
Note that the network constructed in steps 9-10 is N align ,N align The convolutional code is composed of Res 1 Block, Res 2 Block, Res 3 Block, Res 4 Block and STN of a convolutional base network in N, Res 3 Block, Res 4 Block and Res 5 Block in a new convolutional network, and convolutional layers for compressing global features and local and channel combination features. For global feature g align And local and channel combining features
Figure BDA0002510875020000065
Optimizing N using the same loss function in step 7 align Of the learning parameters.
II, a test flow:
the test data set is divided into a query set and a warehouse set, wherein the query set comprises the pedestrian pictures with known identities, and the warehouse set comprises the pictures with the same identities as the pedestrians in the query set and the pictures with different identities from the pedestrians in the query set. The data set is constructed by shooting pictures of pedestrians by monitoring cameras with non-overlapping view angles, automatically marking a rectangular frame of the pedestrians by a pedestrian Detector (DPM), and finally keeping the pictures of the pedestrians in the rectangular frame and adding identity tags of the pedestrians, wherein the pictures of the same pedestrian in the query set and the warehouse set have different shooting view angles. The method comprises the following specific steps:
step 1, inputting a pedestrian picture to be inquired into N align G to be output align And
Figure BDA0002510875020000066
connected to obtain the descriptor of the pedestrian
Figure BDA0002510875020000067
Is a 8704 dimensional feature vector.
And 2, similarly obtaining the pedestrian descriptors of all the pictures in the warehouse through the step 1.
And 3, respectively calculating and storing cosine distances between the pedestrian descriptors to be inquired and each pedestrian descriptor in the warehouse set.
And 4, sequencing the stored distances from small to large, and selecting the pedestrian pictures of the warehouse corresponding to the front k distances as the re-identification result of the pedestrian to be inquired.
And 5, judging whether the real identity of the warehouse pedestrian picture obtained by the proportion recognition is consistent with the identity of the pedestrian to be inquired to measure the recognition performance of the model.
The invention has the following beneficial effects:
according to the invention, various shielding situations are simulated in a data enhancement mode, and the network improves the robustness of the shielding problem by processing the pictures which are artificially shielded. The picture is simultaneously scaled, rotated and translated using the STN to align the pedestrian picture. On the basis that the pictures are aligned, the pictures are simply horizontally divided, so that different body parts of the pedestrian can be well positioned, and the features of the different parts can be obtained (the operations of cutting, channel grouping and affine transformation on the feature map are equivalent to the same operation on the original picture). Different channels in the feature map will respond to different patterns (color, clothing type, gender, age, etc.), so the local and channel combination features can better locate different patterns on different body parts of a pedestrian. For the global features of the whole pedestrian picture, the pedestrian identity is correctly classified through classification loss, the features of the same pedestrian are distributed more closely through similarity loss, and the features of different pedestrians are distributed more distantly. For local and channel combined features, classification loss is not applicable because the information content is small, and the pedestrian identity cannot be correctly classified through the feature. But comparing the different patterns on these different body parts through similarity loss can make the model better distinguish these patterns, making the local and channel combination features more discriminative. Finally, the two features are fused to be used as the pedestrian descriptors, and the discrimination of the pedestrian descriptors is further improved. Through improving the anti-sheltering and discrimination ability of the pedestrian descriptor, the pedestrian re-identification can be more accurate.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is an exemplary diagram of data enhancement;
FIG. 3 is a network constructed by training steps 3-6;
FIG. 4 is a network constructed by training steps 9-10;
FIG. 5 is a flow chart of the test of the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
The training flow of the pedestrian re-identification method based on the local and channel combined features is shown in fig. 1. And firstly carrying out data enhancement on batch training samples, inputting the samples subjected to data enhancement into a convolution base network, and outputting a characteristic diagram. Performing two different operations on the feature map, wherein the first operation is to compress the feature map to obtain a global feature; the second method is to divide the channel into groups and cut horizontally to generate sub-feature map, and then compress the sub-feature map to obtain the local and channel combination features. Different loss functions are applied to the global characteristics and the local and channel combination characteristics, the derivation is carried out on the total loss function, and the network is optimized by utilizing a back propagation algorithm. And aligning the feature maps output by Res2Block in the optimized network through STN, and inputting the aligned feature maps into a new convolution network to obtain an output feature map. The aligned global features and local and channel combination features are obtained for the feature map in the same manner as above, and the same loss function is applied to optimize the new network again.
The method comprises the following specific steps:
step 1, sampling samples in a training set to generate small-batch data:
a small batch of data contains P multiplied by K pictures, P pedestrians with different identities, and K pictures of each pedestrian. If the number of pictures of one pedestrian is more than K in the training set, randomly sampling K pictures; if the number of the pictures is less than K, all the pictures are sampled, and the sampling is not enough and repeated.
Step 2, improving the anti-shielding capability of the model through the data enhancement mode shown in fig. 2:
2-1, generating a picture Pool (Pool) capable of storing pictures with different resolutions;
2-2, before each picture is input into the network, it will be in p 1 And copying one small picture into Pool in a probability mode. Assuming that the resolution of a picture is H × W, the resolution of a small picture, i.e., a picture block, randomly falls within the interval [0.1H,0.2H ]]×[0.1W,0.2W]In between, the locations are also randomly selected.
2-3, then with p 2 And randomly selecting a picture block from Pool according to the probability to cover the picture, and randomly selecting the covered position.
Step 3, loading a pre-training network:
using the ResNet-50 network pre-trained on the ImageNet dataset, the structure before the Global Average Pooling (GAP) Layer of the network is preserved and the step size of the last Convolutional Layer (Convolutional Layer) is set to 1, which is denoted as the Convolutional basis network. A picture with a resolution of 256 × 128 is input into the convolution base network to output a tensor eigenmap T with a size of 16 × 8 × 2048.
And 4, grouping the channels to obtain the characteristics of each group of channels:
dividing the tensor eigen map T with the size of 16 multiplied by 8 multiplied by 2048 obtained in the step 3 into 4 groups along the channel, wherein the size of the tensor eigen map of each group is 16 multiplied by 8 multiplied by 512 and is respectively marked as T 1 ,T 2 ,T 3 ,T 4
And 5, cutting the tensor characteristic diagram to obtain local characteristics:
each group of tensor characteristic map T obtained in the step 4 1 ,T 2 ,T 3 ,T 4 The image is equally divided into 4 local tensor eigenmaps along the horizontal direction, the size of each local tensor eigenmap is 4 multiplied by 8 multiplied by 512, and the local tensor eigenmaps are respectively marked as T 11 ~T 14 ,T 21 ~T 24 ,T 31 ~T 34 ,T 41 ~T 44 . T obtaining 16 local tensor eigenmaps T through steps 4 and 5 11 ~T 14 ,T 21 ~T 24 ,T 31 ~T 34 ,T 41 ~T 44 . Each local tensor eigenmap represents the combined features of different locations and different channels.
And 6, compressing the characteristic diagram:
and (3) convolving the tensor eigen map T, wherein the size of a convolution kernel is 16 multiplied by 8 multiplied by 512, the number of the convolution kernels is 512, and parameters are initialized randomly to obtain the global features g with the size of 1 multiplied by 512. Same pair T 11 ~T 14 ,T 21 ~T 24 ,T 31 ~T 34 ,T 41 ~T 44 Performing convolution respectively, wherein the convolution kernel size corresponding to each local tensor characteristic diagram is 4 multiplied by 8 multiplied by 512, the parameters are initialized randomly, the number is 512, and 16 local tensors with the size of 1 multiplied by 512 are obtainedChannel combination characteristic pc 1 ~pc 16 . The network N constructed in steps 3-6 is shown in fig. 3.
Step 7, applying different loss functions to different characteristics:
Combining features pc for local channels 1 ~pc 16 A Batch Hard sample Triplet Loss (Batch Hard Triplet Loss) was applied separately.
For example, for the characteristic pc 1 And the Batch Hard triple Loss is as follows:
Figure BDA0002510875020000091
for the global signature g, the Batch Hard Triplet Loss and the Softmax Loss are applied, respectively. The Batch Hard triple Loss is as follows:
Figure BDA0002510875020000101
the Softmax Loss is:
Figure BDA0002510875020000102
the overall loss function of the network is:
Figure BDA0002510875020000103
and 8, using a gradient descent algorithm to derive and reversely propagate the Loss function Loss in the step 7 to optimize learnable parameters in the N.
Step 9, aligning the feature graph by using a space transformation network:
9-1, outputting a characteristic diagram F from a 4 th Block (Res 4 Block) of the convolution base network in N 4 (three-dimensional tensor) obtaining a vector theta (theta) with the length of 6 through a residual connecting Block (Res Block, parameter random initialization) and a GAP layer 111213212223 ). Wherein theta is 11122122 For scaling and rotating the characteristic map, theta 1323 For translating the feature map.
9-2, using theta 111213212223 Outputting a profile F for Block 2 (Res 2 Block) of a convolutional base network in N 2 (tensor with size of H multiplied by W multiplied by C) is affine transformed to obtain a blank feature map F " 2 . To F 2 For the eigen map (tensor of H × W size) of the channel c, the coordinate of the pixel point on the eigen map is (x) s ,y s ) After affine transformation becomes (x) t ,y t ) The relationship between the two is:
Figure BDA0002510875020000104
9-3, blank feature map F according to formula (12) " 2 From F 2 Filling the upper sampling pixel to obtain an aligned feature map F' 2 . In an affine process, F' will appear " 2 F corresponding to middle coordinate 2 Coordinate exceeds F 2 In the original range, the pixel value is set to 0 for these coordinates. Occurrence of F " 2 F corresponding to middle coordinate 2 When the coordinates are not pixel points, filling pixel values to the coordinates through bilinear interpolation:
Figure BDA0002510875020000111
step 10, processing the aligned characteristic graph:
for the aligned feature map F " 2 Inputting the data into a new convolutional network, wherein the new network is formed by stacking Res 3 Block, Res 4 Block and Res 5 Block in ResNet-50 network pre-trained on ImageNet data set, and outputting a feature map T with the same size as the feature map T in the step 3 align . For T align The same operations as in steps 3-6 were carried out to obtain 1 piece of the sameGlobal feature g align And 16 local and channel combination features
Figure BDA0002510875020000112
Note that the network constructed in steps 9-10 is N align ,N align The convolutional code is composed of Res 1 Block, Res 2 Block, Res 3 Block, Res 4 Block and STN of a convolutional base network in N, Res 3 Block, Res 4 Block and Res 5 Block in a new convolutional network, and convolutional layers for compressing global features and local and channel combination features, and the specific structure is shown in FIG. 4. For global feature g align And local and channel combining features
Figure BDA0002510875020000113
Optimizing N using the same loss function in step 7 align Of the parameter(s) to be learned.
The testing flow of the pedestrian re-identification method based on the local and channel combination characteristics is shown in fig. 5. And inputting the pedestrian picture to be inquired and all the pedestrian pictures in the warehouse into a trained network, and respectively outputting pedestrian descriptors of the pedestrian pictures. And calculating cosine distances among the pedestrian descriptors, and selecting the front k warehouse pedestrian pictures corresponding to the minimum distance as re-identification results of the pedestrian pictures to be inquired. Whether the identity of the heavily recognized pedestrian is consistent with the identity of the pedestrian to be inquired or not is compared to judge the quality of the model.
The method comprises the following specific steps:
step 1, inputting a pedestrian picture to be inquired into N align G to be output align And
Figure BDA0002510875020000114
connected to obtain the descriptor of the pedestrian
Figure BDA0002510875020000115
Is a 8704 dimensional feature vector.
And 2, similarly obtaining the pedestrian descriptors of all the pictures in the warehouse through the step 1.
And 3, respectively calculating and storing the cosine distance between the pedestrian descriptor to be inquired and each pedestrian descriptor in the warehouse set.
And 4, sequencing the stored distances from small to large, and selecting the pedestrian pictures of the warehouse corresponding to the first k distances as the re-identification result of the pedestrian to be inquired.

Claims (2)

1. A pedestrian re-identification method based on local and channel combination features is characterized by comprising the following procedures:
firstly, a training process: training a neural network to obtain optimal network parameters; a sample in the training data set consists of a pedestrian picture x and a corresponding pedestrian identity ID (x), wherein the ID (x) belongs to { 1., C }; c represents the total number of the identities of the pedestrians, and the pedestrian with one identity has a plurality of pictures;
II, a test flow:
the test data set is divided into an inquiry set and a warehouse set, wherein the inquiry set comprises the pedestrian pictures with known identities, and the warehouse set comprises the pictures with the same identities as the pedestrians in the inquiry set and the pictures with different identities from the pedestrians in the inquiry set; the data set is constructed by firstly shooting pictures of pedestrians by monitoring cameras with non-overlapping view angles, then automatically marking a rectangular frame of the pedestrians by a pedestrian Detector (DPM), and finally keeping the pictures of the pedestrians in the rectangular frame and adding identity tags of the pedestrians, wherein the pictures of the same pedestrian in the query set and the warehouse set have different shooting view angles;
the training process comprises the following specific steps:
step 1, sampling samples in a training set to generate small-batch data:
the small batch of data comprises P multiplied by K pictures, namely P pedestrians with different identities, wherein each pedestrian has K pictures; if the number of pictures of one pedestrian is more than K in the training set, randomly sampling K pictures; if the number of the pictures is less than K, sampling all the pictures, and repeatedly sampling when the number of the pictures is insufficient;
Step 2, improving the anti-shielding capability of the model in a data enhancement mode:
2-1, generating a picture Pool capable of storing pictures with different resolutions;
2-2, before each picture is input into the network, it will be in p 1 Copying one small picture in probability and storing the small picture in Pool; assuming that the resolution of a picture is H × W, the resolution of a small picture, i.e., a picture block, randomly falls within the interval [0.1H,0.2H ]]×[0.1W,0.2W]Position is also randomly selected;
2-3, then with p 2 Randomly selecting a picture block from Pool according to the probability to cover the picture, and randomly selecting the covered position;
step 3, loading a pre-training network:
using the ResNet-50 network pre-trained on the ImageNet dataset, preserving the structure of the network before Global Average Pooling Global Average potential, GAP Layer, and setting the step size of the last Convolutional Layer conditional Layer to 1, which is denoted as "Convolutional basis network"; inputting a picture with the resolution of 256 multiplied by 128 into a convolution base network, and outputting a tensor eigen map T with the size of 16 multiplied by 8 multiplied by 2048;
and 4, grouping the channels to obtain the characteristics of each group of channels:
dividing the tensor eigen map T with the size of 16 multiplied by 8 multiplied by 2048 obtained in the step 3 into 4 groups along a channel, namely the last dimension, wherein the size of the tensor eigen map of each group is 16 multiplied by 8 multiplied by 512 and is respectively recorded as T 1 ,T 2 ,T 3 ,T 4
Step 5, cutting the tensor characteristic diagram to obtain local characteristics:
each group of tensor characteristic map T obtained in the step 4 1 ,T 2 ,T 3 ,T 4 The image is equally divided into 4 local tensor eigenmaps along the horizontal direction, the size of each local tensor eigenmap is 4 multiplied by 8 multiplied by 512, and the local tensor eigenmaps are respectively marked as T 11 ~T 14 ,T 21 ~T 24 ,T 31 ~T 34 ,T 41 ~T 44 (ii) a T obtaining 16 local tensor eigenmaps T through steps 4 and 5 11 ~T 14 ,T 21 ~T 24 ,T 31 ~T 34 ,T 41 ~T 44 (ii) a Each local tensor feature map represents combined features of different positions and different channels;
and 6, compressing the characteristic diagram:
convolving the tensor eigen map T, wherein the size of convolution kernels is 16 multiplied by 8 multiplied by 512, the number of convolution kernels is 512, and parameters are initialized randomly to obtain global features g with the size of 1 multiplied by 512; same pair T 11 ~T 14 ,T 21 ~T 24 ,T 31 ~T 34 ,T 41 ~T 44 Performing convolution respectively, wherein the convolution kernel size corresponding to each local tensor eigenmap is 4 multiplied by 8 multiplied by 512, the number is 512, parameters are initialized randomly, and 16 local channel combined features pc with the size of 1 multiplied by 512 are obtained 1 ~pc 16
Step 7, applying different loss functions to different characteristics:
combining features pc for local channels 1 ~pc 16 Applying Batch Hard sample triple Loss Batch Hard Triplet Loss:
Figure FDA0003689043020000021
in the formula (1), X represents the small batch of data obtained by sampling in the step 1, and theta represents the parameter of the network;
Figure FDA0003689043020000022
representing that the ith pedestrian corresponds to the a picture in the K pictures,
Figure FDA0003689043020000023
Representing that the ith pedestrian corresponds to the p picture in the K pictures, and calling the p picture as a positive sample pair because the two pictures belong to the same pedestrian;
Figure FDA0003689043020000031
represents that the jth pedestrian corresponds to the nth picture in the K pictures because
Figure FDA0003689043020000032
And
Figure FDA0003689043020000033
belonging to different pedestrians and called as a negative sample pair; f. of θ (x) Representing the characteristics output after the picture x is input into the network operation, and D (x, y) represents the Euclidean distance between the characteristics x and y; m is a constant that constrains the relationship between the two feature pairs distance, [ x ]] + Max (0, x); a picture of a pedestrian
Figure FDA0003689043020000034
In other words, traversing the pedestrian corresponds to each of the K pictures
Figure FDA0003689043020000035
Find a particular
Figure FDA0003689043020000036
So that
Figure FDA0003689043020000037
And
Figure FDA0003689043020000038
the Euclidean distance between the two characteristics obtained after the input network operation is maximum respectively,
Figure FDA0003689043020000039
namely a positive and difficult sample pair; meanwhile, each picture of the other pedestrians is traversed, and (P-1) multiplied by K pictures are recorded
Figure FDA00036890430200000310
Find a particular
Figure FDA00036890430200000311
So that
Figure FDA00036890430200000312
And
Figure FDA00036890430200000313
the Euclidean distance between two features obtained after the two features are respectively input into the network for operation is minimum,
Figure FDA00036890430200000314
the negative difficulty sample pair is obtained; the loss function finds out a positive difficulty sample pair and a negative difficulty sample pair corresponding to each picture of each pedestrian, and constrains the relationship between the characteristic distance of the positive difficulty sample pair and the characteristic distance of the negative difficulty sample pair;
For the feature pc 1 And the Batch Hard triple Loss is as follows:
Figure FDA00036890430200000315
in the formula (2)
Figure FDA00036890430200000316
Representing features pc extracted from the a picture of the ith pedestrian 1
Figure FDA00036890430200000317
Representing features pc extracted from the p picture of the ith pedestrian 1
Figure FDA00036890430200000318
Representing features pc extracted from the nth picture of the jth pedestrian 1
For the global feature g, respectively applying a Batch Hard triple Loss and a Softmax Loss; the Batch Hard triple Loss is as follows:
Figure FDA00036890430200000319
in formula (3)
Figure FDA0003689043020000041
Represents from the ithThe feature g extracted from the a picture of the pedestrian,
Figure FDA0003689043020000042
representing the feature g extracted from the p picture of the ith pedestrian,
Figure FDA0003689043020000049
representing a feature g extracted from the nth picture of the jth pedestrian; before applying Softmax Loss, g needs to be input into a full Connected Layer, FC Layer; the number of output neurons of the full connection layer is the total number C of the pedestrian identities of the training set, and parameters of the full connection layer are initialized randomly; the Softmax Loss of the global feature g is:
Figure FDA0003689043020000043
in formula (4)
Figure FDA0003689043020000044
Representing a feature g extracted from the jth picture of the ith pedestrian,
Figure FDA0003689043020000045
representing the identity of the pedestrian corresponding to the picture;
Figure FDA0003689043020000046
represents FC layer No
Figure FDA0003689043020000047
Weight corresponding to each output neuron, W k Representing the weight corresponding to the k output neuron of the FC layer;
the overall loss function of the network is:
Figure FDA0003689043020000048
In the formula (5)λ 123 For three lost weights, satisfy λ 123 =1;
Step 8, recording the network constructed in the step 3-6 as N; using a gradient descent algorithm, deriving the Loss function Loss in the step 7 and optimizing learnable parameters in the N through back propagation;
step 9, aligning the feature graph by using a space transformation network:
9-1, outputting a characteristic diagram F from a Res 4 Block of a 4 th Block of a convolutional base network in N 4 Obtaining a vector theta (theta) with the length of 6 through a residual error connecting block and a GAP layer 111213212223 ) (ii) a Wherein theta is 11122122 For scaling and rotating the characteristic map, theta 1323 For translating the feature map;
9-2, using theta 111213212223 Outputting a characteristic diagram F to a Res 2Block 2 of a convolution base network in N 2 Affine transformation is carried out to obtain a blank feature map F' 2 (ii) a To F 2 For the feature map of channel c, the coordinate of the pixel point is (x) s ,y s ) After affine transformation becomes (x) t ,y t ) The relationship between the two is:
Figure FDA0003689043020000051
9-3, blank feature map F according to formula (6) " 2 From F 2 Filling up the up-sampling pixels to obtain an aligned characteristic diagram F 2 '; in an affine process, F' will appear " 2 F corresponding to middle coordinate 2 Coordinate exceeds F 2 In the original range, the pixel values of the coordinates are set to 0; occurrence of F " 2 F corresponding to middle coordinate 2 When the coordinates are not pixel points, filling pixel values to the coordinates through bilinear interpolation:
Figure FDA0003689043020000052
In the formula (7)
Figure FDA0003689043020000053
Is F' 2 The pixel value of the (m, n) position on the c-channel of (c),
Figure FDA0003689043020000054
is F 2 C channel of (x) s ,y s ) A pixel value of the location;
step 10, processing the aligned characteristic graph:
for aligned feature map F' 2 Inputting the data into a new convolution network, wherein the new convolution network is formed by stacking Res 3 Block, Res 4 Block and Res 5 Block in ResNet-50 network pre-trained on ImageNet data set, and outputting a feature map T with the same size as the feature map T in step 3 align (ii) a For T align The same operation as in step 3-6 is performed, and 1 global feature g is obtained align And 16 local and channel combination features
Figure FDA0003689043020000055
Note that the network constructed in steps 9-10 is N align ,N align The system is composed of Res 1 Block, Res 2 Block, Res 3 Block, Res 4 Block and STN of a convolution base network in N, Res 3 Block, Res 4 Block and Res 5 Block in a new convolution network, and a convolution layer for compressing global characteristics and local and channel combination characteristics; for global feature g align And local and channel combination features
Figure FDA0003689043020000056
Optimizing N using the same loss function in step 7 align Of the parameter(s) to be learned.
2. The pedestrian re-identification method based on the local and channel combined features as claimed in claim 1, wherein the testing process comprises the following steps:
Step 1, inputting a pedestrian picture to be inquired into N align G to be output align And
Figure FDA0003689043020000057
connected to obtain the descriptor of the pedestrian
Figure FDA0003689043020000058
Is a 8704-dimensional feature vector;
step 2, obtaining pedestrian descriptors of all the pictures in the warehouse set through the step 1;
step 3, respectively calculating and storing cosine distances between the pedestrian descriptors to be inquired and each pedestrian descriptor in the warehouse set;
4, sequencing the stored distances in the order from small to large, and selecting the pedestrian pictures of the warehouse corresponding to the front k distances as the re-identification result of the pedestrian to be inquired;
and 5, judging whether the real identity of the pedestrian picture of the warehouse obtained by the proportion recognition is consistent with the identity of the pedestrian to be inquired to measure the recognition performance of the model.
CN202010460902.9A 2020-05-27 2020-05-27 Pedestrian re-identification method based on local and channel combination characteristics Active CN111709313B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010460902.9A CN111709313B (en) 2020-05-27 2020-05-27 Pedestrian re-identification method based on local and channel combination characteristics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010460902.9A CN111709313B (en) 2020-05-27 2020-05-27 Pedestrian re-identification method based on local and channel combination characteristics

Publications (2)

Publication Number Publication Date
CN111709313A CN111709313A (en) 2020-09-25
CN111709313B true CN111709313B (en) 2022-07-29

Family

ID=72537979

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010460902.9A Active CN111709313B (en) 2020-05-27 2020-05-27 Pedestrian re-identification method based on local and channel combination characteristics

Country Status (1)

Country Link
CN (1) CN111709313B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112232194A (en) * 2020-10-15 2021-01-15 广州云从凯风科技有限公司 Single-target human body key point detection method, system, equipment and medium
CN112686176B (en) * 2020-12-30 2024-05-07 深圳云天励飞技术股份有限公司 Target re-identification method, model training method, device, equipment and storage medium
CN113343909B (en) * 2021-06-29 2023-09-26 南京星云数字技术有限公司 Training method of multi-task classification network and pedestrian re-recognition method
CN113255615B (en) * 2021-07-06 2021-09-28 南京视察者智能科技有限公司 Pedestrian retrieval method and device for self-supervision learning
CN114170516B (en) * 2021-12-09 2022-09-13 清华大学 Vehicle weight recognition method and device based on roadside perception and electronic equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108009528A (en) * 2017-12-26 2018-05-08 广州广电运通金融电子股份有限公司 Face authentication method, device, computer equipment and storage medium based on Triplet Loss
CN108399362A (en) * 2018-01-24 2018-08-14 中山大学 A kind of rapid pedestrian detection method and device
CN109583379A (en) * 2018-11-30 2019-04-05 常州大学 A kind of pedestrian's recognition methods again being aligned network based on selective erasing pedestrian
CN110543817A (en) * 2019-07-25 2019-12-06 北京大学 Pedestrian re-identification method based on posture guidance feature learning
CN110659573A (en) * 2019-08-22 2020-01-07 北京捷通华声科技股份有限公司 Face recognition method and device, electronic equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108009528A (en) * 2017-12-26 2018-05-08 广州广电运通金融电子股份有限公司 Face authentication method, device, computer equipment and storage medium based on Triplet Loss
CN108399362A (en) * 2018-01-24 2018-08-14 中山大学 A kind of rapid pedestrian detection method and device
WO2019144575A1 (en) * 2018-01-24 2019-08-01 中山大学 Fast pedestrian detection method and device
CN109583379A (en) * 2018-11-30 2019-04-05 常州大学 A kind of pedestrian's recognition methods again being aligned network based on selective erasing pedestrian
CN110543817A (en) * 2019-07-25 2019-12-06 北京大学 Pedestrian re-identification method based on posture guidance feature learning
CN110659573A (en) * 2019-08-22 2020-01-07 北京捷通华声科技股份有限公司 Face recognition method and device, electronic equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Combining multilevel feature extraction and multi-loss learning for person re-identification;Weilin Zhong 等;《Neurocomputing》;20191231;全文 *

Also Published As

Publication number Publication date
CN111709313A (en) 2020-09-25

Similar Documents

Publication Publication Date Title
CN111709313B (en) Pedestrian re-identification method based on local and channel combination characteristics
CN110443143B (en) Multi-branch convolutional neural network fused remote sensing image scene classification method
Xie et al. Multilevel cloud detection in remote sensing images based on deep learning
CN110348319B (en) Face anti-counterfeiting method based on face depth information and edge image fusion
CN106599883B (en) CNN-based multilayer image semantic face recognition method
CN112766158B (en) Multi-task cascading type face shielding expression recognition method
CN107066559B (en) Three-dimensional model retrieval method based on deep learning
CN110532920B (en) Face recognition method for small-quantity data set based on FaceNet method
CN110209859B (en) Method and device for recognizing places and training models of places and electronic equipment
CN109145745B (en) Face recognition method under shielding condition
CN108154133B (en) Face portrait-photo recognition method based on asymmetric joint learning
Yang et al. A deep multiscale pyramid network enhanced with spatial–spectral residual attention for hyperspectral image change detection
CN112580480B (en) Hyperspectral remote sensing image classification method and device
CN113361495A (en) Face image similarity calculation method, device, equipment and storage medium
CN108960142B (en) Pedestrian re-identification method based on global feature loss function
Mshir et al. Signature recognition using machine learning
CN110852292B (en) Sketch face recognition method based on cross-modal multi-task depth measurement learning
CN107784284B (en) Face recognition method and system
CN115311502A (en) Remote sensing image small sample scene classification method based on multi-scale double-flow architecture
CN111105436B (en) Target tracking method, computer device and storage medium
CN114882537A (en) Finger new visual angle image generation method based on nerve radiation field
CN112329662B (en) Multi-view saliency estimation method based on unsupervised learning
CN113763417B (en) Target tracking method based on twin network and residual error structure
CN111582057B (en) Face verification method based on local receptive field
CN112418262A (en) Vehicle re-identification method, client and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant