CN111709313A - Pedestrian re-identification method based on local and channel combination characteristics - Google Patents

Pedestrian re-identification method based on local and channel combination characteristics Download PDF

Info

Publication number
CN111709313A
CN111709313A CN202010460902.9A CN202010460902A CN111709313A CN 111709313 A CN111709313 A CN 111709313A CN 202010460902 A CN202010460902 A CN 202010460902A CN 111709313 A CN111709313 A CN 111709313A
Authority
CN
China
Prior art keywords
pedestrian
picture
pictures
network
local
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010460902.9A
Other languages
Chinese (zh)
Other versions
CN111709313B (en
Inventor
徐尔立
翁立
王建中
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN202010460902.9A priority Critical patent/CN111709313B/en
Publication of CN111709313A publication Critical patent/CN111709313A/en
Application granted granted Critical
Publication of CN111709313B publication Critical patent/CN111709313B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/103Static body considered as a whole, e.g. static pedestrian or occupant recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a pedestrian re-identification method based on local and channel combination characteristics. According to the invention, various shielding situations are simulated in a data enhancement mode, and the robustness of the shielding problem is improved. The picture is simultaneously scaled, rotated and translated using the STN to align the pedestrian picture. And horizontally dividing the picture to obtain the characteristics of different parts. For the global features of the whole pedestrian picture, the pedestrian identity is correctly classified through classification loss, the features of the same pedestrian are distributed more closely through similarity loss, and the features of different pedestrians are distributed more distantly. For local and channel combination features, different patterns on these different body parts are compared by similarity loss. Finally, the two features are fused to be used as the pedestrian descriptors, and the discrimination of the pedestrian descriptors is further improved. Through improving the anti-sheltering and discrimination ability of the pedestrian descriptor, the pedestrian re-identification can be more accurate.

Description

Pedestrian re-identification method based on local and channel combination characteristics
Technical Field
The invention belongs to the field of computer vision and image retrieval, and relates to a pedestrian re-identification method based on local and channel combination characteristics. The method solves some common problems in the field of pedestrian re-identification.
Background
With the development and popularization of monitoring systems, more and more pedestrian image data are urgently pending. The pedestrian re-identification technology is to find out the image of the pedestrian from the images of the pedestrians shot by a certain camera and the images of the pedestrians shot by other cameras. The intelligent pedestrian detection system has wide application scenes in real life, such as intelligent security, criminal investigation, man-machine interaction and the like, and is also closely related to other fields such as pedestrian detection, pedestrian tracking and the like.
The pedestrian re-identification method commonly used at present is based on a Convolutional Neural Network (CNN). Some approaches therefore aim to design or refine network models to extract more discriminative pedestrian image features, such as the residual network ResNet-50 pre-trained on ImageNet data sets and fine-tuned on the pedestrian re-identification data sets. While some methods work on improving or designing the loss functions, the loss functions are mainly classified into two categories: 1) classifying the loss, treating each pedestrian as a particular class, such as cross-entropy loss (cross-entropy loss); 2) and similarity loss, namely, the relationship which restricts the similarity between pedestrian images, such as contrast loss (contrast loss), triple loss (triple loss) and quadruple loss (quadruple loss).
Disclosure of Invention
Aiming at the problems in the existing pedestrian re-identification field, the invention provides a pedestrian re-identification method based on local and channel combination characteristics. The method has the following advantages: 1) the network model improves the capability of resisting the shielding problem in a data enhancement mode; 2) solving the problem of misalignment of images of pedestrians through a Spatial Transformer Network (STN); 3) the local and channel combination characteristics with more discrimination are obtained by cutting the characteristic diagram and grouping the characteristic diagram channels; 4) by applying different loss functions to different features, the discrimination of the features is further improved. The method provided by the invention comprehensively solves the main problems of shielding, misalignment, large pedestrian appearance change and the like in pedestrian re-identification, so that the method has more accurate identification capability.
A pedestrian re-identification method based on local and channel combination features comprises the following procedures:
firstly, a training process: the neural network is trained to obtain the best network parameters. The samples in the training dataset consist of a pedestrian picture x and its corresponding pedestrian identity id (x), id (x) e { 1. C represents the total number of the identities of the pedestrians, and the pedestrian with one identity has a plurality of pictures. The method comprises the following specific steps:
step 1, sampling samples in a training set to generate small-batch data:
a small batch of data contains P x K pictures, namely P pedestrians with different identities, and K pictures of each pedestrian. If the number of pictures of one pedestrian is more than K in the training set, randomly sampling K pictures; if the number of the pictures is less than K, all the pictures are sampled, and the sampling is not enough and repeated.
Step 2, improving the anti-shielding capability of the model in a data enhancement mode:
2-1, generating a picture Pool (Pool) capable of storing pictures with different resolutions;
2-2, before each picture is input into the network, it will be in p1Probability copying one of the small pictures into Pool, assuming the resolution of the picture is H × W, the resolution of one small picture, i.e. the picture block, falls randomly in the interval [0.1H,0.2H ]]×[0.1W,0.2W]In between, the locations are also randomly selected.
2-3, then with p2And randomly selecting a picture block from Pool according to the probability to cover the picture, and randomly selecting the covered position.
Step 3, loading a pre-training network:
using the ResNet-50 network pre-trained on the ImageNet dataset, the structure before the Global Average Pooling (GAP) layer of the network is preserved and the step size of the last convolutional layer (convolutional layer) is set to 1, which is denoted as the "convolutional based network". After a picture with the resolution of 256 × 128 is input into the convolution base network, a tensor eigenmap T with the size of 16 × 8 × 2048 is output.
And 4, grouping the channels to obtain the characteristics of each group of channels:
dividing the tensor eigenmap T with the size of 16 × 8 × 2048 obtained in the step 3 into 4 groups along the channel (namely the last dimension) on average, wherein the size of the tensor eigenmap T of each group is 16 × 8 × 512 which is respectively recorded as T1,T2,T3,T4
And 5, cutting the tensor characteristic diagram to obtain local characteristics:
each group of tensor characteristic map T obtained in the step 41,T2,T3,T4The image is equally divided into 4 local tensor eigenmaps along the horizontal direction, the size of each local tensor eigenmap is 4 × 8 × 512, and the local tensor eigenmaps are respectively recorded as T11~T14,T21~T24,T31~T34,T41~T44. T obtaining 16 local tensor eigenmaps T through steps 4 and 511~T14,T21~T24,T31~T34,T41~T44. Each local tensor eigenmap represents the combined features of different locations and different channels.
And 6, compressing the characteristic diagram:
convolving the tensor eigen map T, wherein the size of a convolution kernel is 16 × 8 × 512, the number of the convolution kernels is 512, parameters are initialized randomly, and global features g with the size of 1 × 1 × 512 are obtained11~T14,T21~T24,T31~T34,T41~T44Performing convolution respectively, wherein the convolution kernel size corresponding to each local tensor characteristic diagram is 4 × 8 × 512, the number of the convolution kernels is 512, parameters are initialized randomly, and 16 local channel combined characteristics pc with the size of 1 × 1 × 512 are obtained1~pc16
Step 7, applying different loss functions to different characteristics:
combining features pc for local channels1~pc16Applying Batch hard sample triple Loss (Batch HardScript Loss) separately):
Figure BDA0002510875020000031
In the formula (1), X represents the small batch of data obtained by sampling in the step 1, and theta represents the parameter of the network.
Figure BDA0002510875020000032
Representing that the ith pedestrian corresponds to the a picture in the K pictures,
Figure BDA0002510875020000033
representing that the ith pedestrian corresponds to the p picture in the K pictures, and calling the p picture as a positive sample pair because the two pictures belong to the same pedestrian;
Figure BDA0002510875020000034
represents that the jth pedestrian corresponds to the nth picture in the K pictures because
Figure BDA0002510875020000035
And
Figure BDA0002510875020000036
belonging to different pedestrians and called as a negative sample pair. f. ofθ(x) D (x, y) represents the Euclidean Distance (Euclidean Distance) between the feature x and the feature y. m is a constant that constrains the relationship between the two feature pairs distance, [ x ]]+Max (0, x). A picture of a pedestrian
Figure BDA0002510875020000037
In other words, traversing the pedestrian corresponds to each of the K pictures
Figure BDA0002510875020000041
Find a particular
Figure BDA0002510875020000042
So that
Figure BDA0002510875020000043
And
Figure BDA0002510875020000044
the Euclidean distance between the two characteristics obtained after the input network operation is maximum respectively,
Figure BDA0002510875020000045
namely a positive and difficult sample pair, and simultaneously, each picture (P-1) × K pictures in total) of the other pedestrians is traversed
Figure BDA0002510875020000046
Find a particular
Figure BDA0002510875020000047
So that
Figure BDA0002510875020000048
And
Figure BDA0002510875020000049
the Euclidean distance between the two characteristics obtained after the input network operation is respectively minimum,
Figure BDA00025108750200000410
namely a negative difficulty sample pair. The loss function finds out the corresponding positive difficulty and negative difficulty sample pairs of each picture of each pedestrian, and constrains the relationship between the positive difficulty sample pairs characteristic distance and the negative difficulty sample pairs characteristic distance.
For the feature pc1And the Batch Hard triple Loss is as follows:
Figure BDA00025108750200000411
in the formula (2)
Figure BDA00025108750200000412
Representing a feature pc extracted from the a picture of the ith pedestrian1
Figure BDA00025108750200000413
Representing features pc extracted from the p picture of the ith pedestrian1
Figure BDA00025108750200000414
Representing features pc extracted from the nth picture of the jth pedestrian1
For the global signature g, the Batch Hard Triplet Loss and the Softmax Loss are applied, respectively. The BatchHard triple Loss is as follows:
Figure BDA00025108750200000415
in the formula (3)
Figure BDA00025108750200000416
Representing the feature g extracted from the a picture of the ith pedestrian,
Figure BDA00025108750200000417
representing the feature g extracted from the p picture of the ith pedestrian,
Figure BDA00025108750200000418
representing the feature g extracted from the nth picture of the jth pedestrian. Before applying Softmax Loss, g needs to be imported into a Fully Connected Layer (FC Layer). The number of output neurons of the full connection layer is the total number C of the pedestrian identities of the training set, and parameters of the full connection layer are initialized randomly. The Softmax Loss of the global feature g is:
Figure BDA00025108750200000419
in the formula (4)
Figure BDA0002510875020000051
Representing a feature g extracted from the jth picture of the ith pedestrian,
Figure BDA0002510875020000052
representing the identity of the pedestrian corresponding to the picture。
Figure BDA0002510875020000053
Represents FClayer No
Figure BDA0002510875020000054
Weight corresponding to each output neuron, WkRepresents the weight corresponding to the k output neuron of the FClayer.
The overall loss function of the network is:
Figure BDA0002510875020000055
in formula (5) < lambda >123For three lost weights, satisfy λ123=1。
And 8, recording the network constructed in the step 3-6 as N. Using a gradient descent algorithm, the Loss function Loss in step 7 is derived and the learnable parameters in N are optimized by back propagation.
Step 9, aligning the feature graph by using a space transformation network:
9-1, outputting a characteristic diagram F from a 4 th Block (Res 4Block) of the convolution base network in N4(three-dimensional tensor) obtaining a vector theta (theta) with the length of 6 through a residual connecting Block (Res Block, parameter random initialization) and a GAP layer111213212223). Wherein theta is11122122For scaling and rotating the characteristic map, theta1323For translating the feature map.
9-2, using theta111213212223Outputting a profile F for Block 2 (Res2Block) of a convolutional base network in N2(tensor with size H × W × C) is affine transformed to obtain a blank feature map F "2. To F2For the eigen map of channel c (tensor of H × W), the coordinate of the pixel point is (x)s,ys) After affine transformation becomes (x)t,yt) The relationship between the two is:
Figure BDA0002510875020000056
9-3, blank feature map F according to formula (6) "2From F2Filling the upper sampling pixel to obtain an aligned feature map F'2. . In an affine process, F' will appear "2F corresponding to middle coordinate2Coordinate exceeds F2In the original range, the pixel value is set to 0 for these coordinates. Occurrence of F "2F corresponding to middle coordinate2When the coordinates are not pixel points, filling pixel values to the coordinates through bilinear interpolation:
Figure BDA0002510875020000061
in the formula (7)
Figure BDA0002510875020000062
Is F'2The pixel value of the (m, n) position on the c-channel of (c),
Figure BDA0002510875020000063
is F2The (m, n) position on the c-channel of (c).
Step 10, processing the aligned characteristic graph:
for aligned feature map F'2Inputting the data into a new convolution network, wherein the new convolution network is formed by stacking Res3Block, Res 4Block and Res 5Block in ResNet-50 network pre-trained on ImageNet data set, and outputting a feature map T with the same size as the feature map T in step 3align. For TalignThe same operation as in step 3-6 is performed, and 1 global feature g is obtainedalignAnd 16 local and channel combination features
Figure BDA0002510875020000064
Note that the network constructed in steps 9-10 is Nalign,NalignThe convolutional code is composed of Res 1Block, Res2Block, Res3Block, Res 4Block and STN of a convolutional base network in N, Res3Block, Res 4Block and Res 5Block in a new convolutional network, and convolutional layers for compressing global features and local and channel combination features. For global feature galignAnd local and channel combination features
Figure BDA0002510875020000065
Optimizing N using the same loss function in step 7alignOf the parameter(s) to be learned.
II, a test flow:
the test data set is divided into a query set and a warehouse set, wherein the query set comprises the pedestrian pictures with known identities, and the warehouse set comprises the pictures with the same identities as the pedestrians in the query set and the pictures with different identities from the pedestrians in the query set. The data set is constructed by shooting pictures of pedestrians by monitoring cameras with non-overlapping view angles, automatically marking a rectangular frame of the pedestrians by a pedestrian Detector (DPM), and finally keeping the pictures of the pedestrians in the rectangular frame and adding identity tags of the pedestrians, wherein the pictures of the same pedestrian in the query set and the warehouse set have different shooting view angles. The method comprises the following specific steps:
step 1, inputting a pedestrian picture to be inquired into NalignG to be outputalignAnd
Figure BDA0002510875020000066
connected to obtain the descriptor of the pedestrian
Figure BDA0002510875020000067
Is a 8704 dimensional feature vector.
And 2, similarly obtaining the pedestrian descriptors of all the pictures in the warehouse through the step 1.
And 3, respectively calculating and storing cosine distances between the pedestrian descriptors to be inquired and each pedestrian descriptor in the warehouse set.
And 4, sequencing the stored distances from small to large, and selecting the pedestrian pictures of the warehouse corresponding to the front k distances as the re-identification result of the pedestrian to be inquired.
And 5, judging whether the real identity of the pedestrian picture of the warehouse obtained by the proportion recognition is consistent with the identity of the pedestrian to be inquired to measure the recognition performance of the model.
The invention has the following beneficial effects:
according to the invention, various shielding situations are simulated in a data enhancement mode, and the network improves the robustness to the shielding problem by processing the pictures artificially shielded. The picture is simultaneously scaled, rotated and translated using the STN to align the pedestrian picture. On the basis that the pictures are aligned, the pictures are simply horizontally divided, so that different body parts of the pedestrian can be well positioned, and the features of the different parts can be obtained (the operations of cutting, channel grouping and affine transformation on the feature map are equivalent to the same operation on the original picture). Different channels in the feature map will respond to different patterns (color, clothing type, gender, age, etc.), so the local and channel combination features can better locate different patterns on different body parts of a pedestrian. For the global features of the whole pedestrian picture, the pedestrian identity is correctly classified through classification loss, the features of the same pedestrian are distributed more closely through similarity loss, and the features of different pedestrians are distributed more distantly. For local and channel combined features, classification loss is not applicable because the information content is small, and the pedestrian identity cannot be correctly classified through the feature. But comparing the different patterns on these different body parts through similarity loss can make the model better distinguish these patterns, making the local and channel combination features more discriminative. Finally, the two features are fused to be used as the pedestrian descriptors, and the discrimination of the pedestrian descriptors is further improved. Through improving the anti-sheltering and discrimination ability of the pedestrian descriptor, the pedestrian re-identification can be more accurate.
Drawings
FIG. 1 is a flow chart of the training of the present invention;
FIG. 2 is an exemplary diagram of data enhancement;
FIG. 3 is a network constructed by training steps 3-6;
FIG. 4 is a network constructed by training steps 9-10;
FIG. 5 is a flow chart of the test of the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
The training flow of the pedestrian re-identification method based on the local and channel combined features is shown in fig. 1. And firstly carrying out data enhancement on batch training samples, inputting the samples subjected to data enhancement into a convolution base network, and outputting a characteristic diagram. Performing two different operations on the feature map, wherein the first operation is to compress the feature map to obtain a global feature; the second method is to divide the channel into groups and cut horizontally to generate sub-feature map, and then compress the sub-feature map to obtain the local and channel combination features. Different loss functions are applied to the global characteristics and the local and channel combination characteristics, the derivation is carried out on the total loss function, and the network is optimized by utilizing a back propagation algorithm. And aligning the feature maps output by Res2Block in the optimized network through STN, and inputting the aligned feature maps into a new convolution network to obtain an output feature map. The aligned global features and local and channel combination features are obtained for the feature map in the same manner as above, and the same loss function is applied to optimize the new network again.
The method comprises the following specific steps:
step 1, sampling samples in a training set to generate small-batch data:
a small batch of data contains P multiplied by K pictures, P pedestrians with different identities, and K pictures of each pedestrian. If the number of pictures of one pedestrian is more than K in the training set, randomly sampling K pictures; if the number of the pictures is less than K, all the pictures are sampled, and the sampling is not enough and repeated.
Step 2, improving the anti-shielding capability of the model through the data enhancement mode shown in fig. 2:
2-1, generating a picture Pool (Pool) capable of storing pictures with different resolutions;
2-2, before each picture is input into the network, it will be in p1Probability complexOne of the small pictures is stored in Pool, assuming that the resolution of the picture is H × W, the resolution of one small picture, i.e. the picture block, randomly falls within the interval [0.1H,0.2H ]]×[0.1W,0.2W]In between, the locations are also randomly selected.
2-3, then with p2And randomly selecting a picture block from Pool according to the probability to cover the picture, and randomly selecting the covered position.
Step 3, loading a pre-training network:
using the ResNet-50 network pre-trained on the ImageNet dataset, the structure before the Global Average Pooling (GAP) layer of the network is preserved and the step size of the last convolutional layer (convolutional layer) is set to 1, which is recorded as the convolutional basis network. A picture with a resolution of 256 × 128 is input into the convolution base network to output a tensor eigenmap T with a size of 16 × 8 × 2048.
And 4, grouping the channels to obtain the characteristics of each group of channels:
dividing the tensor eigenmap T with the size of 16 × 8 × 2048 obtained in the step 3 into 4 groups along the channel, wherein the tensor eigenmap size of each group is 16 × 8 × 512 which is respectively marked as T1,T2,T3,T4
And 5, cutting the tensor characteristic diagram to obtain local characteristics:
each group of tensor characteristic map T obtained in the step 41,T2,T3,T4The image is equally divided into 4 local tensor eigenmaps along the horizontal direction, the size of each local tensor eigenmap is 4 × 8 × 512, and the local tensor eigenmaps are respectively recorded as T11~T14,T21~T24,T31~T34,T41~T44. T obtaining 16 local tensor eigenmaps T through steps 4 and 511~T14,T21~T24,T31~T34,T41~T44. Each local tensor eigenmap represents the combined features of different locations and different channels.
And 6, compressing the characteristic diagram:
convolving the tensor eigen-map T with a convolution kernel size of 16× 8 × 512, 512 in number, the parameters are initialized randomly to obtain global feature g of size 1 × 1 × 512, the same for T11~T14,T21~T24,T31~T34,T41~T44Performing convolution respectively, wherein the convolution kernel size corresponding to each local tensor feature map is 4 × 8 × 512, the parameters are initialized randomly, the number is 512, and 16 local channel combined features pc with the size of 1 × 1 × 512 are obtained1~pc16. The network N constructed in steps 3-6 is shown in fig. 3.
Step 7, applying different loss functions to different characteristics:
combining features pc for local channels1~pc16Batch hard sample triplet Loss (Batch hardtpriplet Loss) was applied separately.
For example, for the characteristic pc1And the Batch Hard triple Loss is as follows:
Figure BDA0002510875020000091
for the global signature g, the Batch Hard Triplet Loss and the Softmax Loss are applied, respectively. The BatchHard triple Loss is as follows:
Figure BDA0002510875020000101
the Softmax Loss is:
Figure BDA0002510875020000102
the overall loss function of the network is:
Figure BDA0002510875020000103
and 8, using a gradient descent algorithm to derive and reversely propagate the Loss function Loss in the step 7 to optimize learnable parameters in the N.
Step 9, aligning the feature graph by using a space transformation network:
9-1, outputting a characteristic diagram F from a 4 th Block (Res 4Block) of the convolution base network in N4(three-dimensional tensor) obtaining a vector theta (theta) with the length of 6 through a residual connecting Block (Res Block, parameter random initialization) and a GAP layer111213212223). Wherein theta is11122122For scaling and rotating the characteristic map, theta1323For translating the feature map.
9-2, using theta111213212223Outputting a profile F for Block 2 (Res2Block) of a convolutional base network in N2(tensor with size H × W × C) is affine transformed to obtain a blank feature map F "2. To F2For the eigen map of channel c (tensor of H × W), the coordinate of the pixel point is (x)s,ys) After affine transformation becomes (x)t,yt) The relationship between the two is:
Figure BDA0002510875020000104
9-3, blank feature map F according to formula (12) "2From F2Filling the upper sampling pixel to obtain an aligned feature map F'2. In an affine process, F' will appear "2F corresponding to middle coordinate2Coordinate exceeds F2In the original range, the pixel value is set to 0 for these coordinates. Occurrence of F "2F corresponding to middle coordinate2When the coordinates are not pixel points, filling pixel values to the coordinates through bilinear interpolation:
Figure BDA0002510875020000111
step 10, processing the aligned characteristic graph:
for the aligned feature map F "2Inputting it into a new convolutional network, said new network being formed by ImaRes3Block, Res 4Block and Res 5Block in a pretrained ResNet-50 network on the geNet data set are stacked, and a feature map T with the same size as the feature map T in the step 3 is outputalign. For TalignThe same operation as in step 3-6 is performed, and 1 global feature g is obtainedalignAnd 16 local and channel combination features
Figure BDA0002510875020000112
Note that the network constructed in steps 9-10 is Nalign,NalignThe convolutional code is composed of Res 1Block, Res2Block, Res3Block, Res 4Block and STN of a convolutional base network in N, Res3Block, Res 4Block and Res 5Block in a new convolutional network, and a convolutional layer for compressing global characteristics and local and channel combination characteristics, and the specific structure is shown in FIG. 4. For global feature galignAnd local and channel combination features
Figure BDA0002510875020000113
Optimizing N using the same loss function in step 7alignOf the parameter(s) to be learned.
The testing flow of the pedestrian re-identification method based on the local and channel combination characteristics is shown in fig. 5. And inputting the pedestrian picture to be inquired and all the pedestrian pictures in the warehouse into a trained network, and respectively outputting pedestrian descriptors of the pedestrian pictures. And calculating cosine distances among the pedestrian descriptors, and selecting the front k warehouse pedestrian pictures corresponding to the minimum distance as re-identification results of the pedestrian pictures to be inquired. Whether the identity of the heavily recognized pedestrian is consistent with the identity of the pedestrian to be inquired or not is compared to judge the quality of the model.
The method comprises the following specific steps:
step 1, inputting a pedestrian picture to be inquired into NalignG to be outputalignAnd
Figure BDA0002510875020000114
connected to obtain the descriptor of the pedestrian
Figure BDA0002510875020000115
Is a 8704 dimensional feature vector.
And 2, similarly obtaining the pedestrian descriptors of all the pictures in the warehouse through the step 1.
And 3, respectively calculating and storing cosine distances between the pedestrian descriptors to be inquired and each pedestrian descriptor in the warehouse set.
And 4, sequencing the stored distances from small to large, and selecting the pedestrian pictures of the warehouse corresponding to the front k distances as the re-identification result of the pedestrian to be inquired.

Claims (3)

1. A pedestrian re-identification method based on local and channel combination features is characterized by comprising the following procedures:
firstly, a training process: training a neural network to obtain optimal network parameters; a sample in the training data set consists of a pedestrian picture x and a corresponding pedestrian identity ID (x), wherein the ID (x) belongs to { 1., C }; c represents the total number of the identities of the pedestrians, and the pedestrian with one identity has a plurality of pictures;
II, a test flow:
the test data set is divided into an inquiry set and a warehouse set, wherein the inquiry set comprises the pedestrian pictures with known identities, and the warehouse set comprises the pictures with the same identities as the pedestrians in the inquiry set and the pictures with different identities from the pedestrians in the inquiry set; the data set is constructed by shooting pictures of pedestrians by monitoring cameras with non-overlapping view angles, automatically marking a rectangular frame of the pedestrians by a pedestrian Detector (DPM), and finally keeping the pictures of the pedestrians in the rectangular frame and adding identity tags of the pedestrians, wherein the pictures of the same pedestrian in the query set and the warehouse set have different shooting view angles.
2. The pedestrian re-identification method based on the local and channel combined features as claimed in claim 1, wherein the training process comprises the following steps:
step 1, sampling samples in a training set to generate small-batch data:
the small batch of data comprises P multiplied by K pictures, namely P pedestrians with different identities, wherein each pedestrian has K pictures; if the number of pictures of one pedestrian is more than K in the training set, randomly sampling K pictures; if the number of the pictures is less than K, sampling all the pictures, and repeatedly sampling when the number of the pictures is insufficient;
step 2, improving the anti-shielding capability of the model in a data enhancement mode:
2-1, generating a picture Pool (Pool) capable of storing pictures with different resolutions;
2-2, before each picture is input into the network, it will be in p1Probability copying one small picture into Pool, assuming the resolution of picture is H × W, the resolution of one small picture, i.e. picture block, is randomly in the interval [0.1H,0.2H ]]×[0.1W,0.2W]Position is also randomly selected;
2-3, then with p2Randomly selecting a picture block from Pool according to the probability to cover the picture, and randomly selecting the covered position;
step 3, loading a pre-training network:
using the ResNet-50 network pre-trained on the ImageNet dataset, the structure before the Global Average Pooling (GAP) layer of this network is preserved, and the step size of the last convolutional layer (convolutional layer) is set to 1, which is denoted as the "convolutional basis network"; inputting a picture with the resolution of 256 multiplied by 128 into a convolution base network, and outputting a tensor eigen map T with the size of 16 multiplied by 8 multiplied by 2048;
and 4, grouping the channels to obtain the characteristics of each group of channels:
dividing the tensor eigenmap T with the size of 16 × 8 × 2048 obtained in the step 3 into 4 groups along the channel (namely the last dimension) on average, wherein the size of the tensor eigenmap T of each group is 16 × 8 × 512 which is respectively recorded as T1,T2,T3,T4
And 5, cutting the tensor characteristic diagram to obtain local characteristics:
each group of tensor characteristic map T obtained in the step 41,T2,T3,T4And averagely cutting the image into 4 local tensor eigenmaps along the horizontal direction, wherein the size of each local tensor eigenmap is 4 × 8 × 512 which is divided into four partsRespectively recorded as T11~T14,T21~T24,T31~T34,T41~T44(ii) a T obtaining 16 local tensor eigenmaps T through steps 4 and 511~T14,T21~T24,T31~T34,T41~T44(ii) a Each local tensor feature map represents combined features of different positions and different channels;
and 6, compressing the characteristic diagram:
convolving the tensor eigen map T, wherein the size of a convolution kernel is 16 × 8 × 512, the number of the convolution kernels is 512, parameters are initialized randomly to obtain global features g with the size of 1 × 1 × 512, and similarly, the tensor eigen map T is convolved11~T14,T21~T24,T31~T34,T41~T44Performing convolution respectively, wherein the convolution kernel size corresponding to each local tensor characteristic diagram is 4 × 8 × 512, the number of the convolution kernels is 512, parameters are initialized randomly, and 16 local channel combined characteristics pc with the size of 1 × 1 × 512 are obtained1~pc16
Step 7, applying different loss functions to different characteristics:
combining features pc for local channels1~pc16Batch hard sample triple Loss (Batch hardtpriplet Loss) was applied separately:
Figure FDA0002510875010000021
in the formula (1), X represents the small batch of data obtained by sampling in the step 1, and theta represents the parameter of the network;
Figure FDA0002510875010000031
representing that the ith pedestrian corresponds to the a picture in the K pictures,
Figure FDA0002510875010000032
representing that the ith pedestrian corresponds to the p picture in the K pictures, and calling the p picture as a positive sample pair because the two pictures belong to the same pedestrian;
Figure FDA0002510875010000033
represents that the jth pedestrian corresponds to the nth picture in the K pictures because
Figure FDA0002510875010000034
And
Figure FDA0002510875010000035
belonging to different pedestrians and called as a negative sample pair; f. ofθ(x) Representing the characteristics of the picture x output after the input network operation, and D (x, y) representing the Euclidean Distance (Euclidean Distance) between the characteristics x and y; m is a constant that constrains the relationship between the two feature pairs distance, [ x ]]+Max (0, x); a picture of a pedestrian
Figure FDA0002510875010000036
In other words, traversing the pedestrian corresponds to each of the K pictures
Figure FDA0002510875010000037
Find a particular
Figure FDA0002510875010000038
So that
Figure FDA0002510875010000039
And
Figure FDA00025108750100000310
the Euclidean distance between the two characteristics obtained after the input network operation is maximum respectively,
Figure FDA00025108750100000311
namely a positive and difficult sample pair, and simultaneously, each picture (P-1) × K pictures in total) of the other pedestrians is traversed
Figure FDA00025108750100000312
Find specialIs fixed
Figure FDA00025108750100000313
So that
Figure FDA00025108750100000314
And
Figure FDA00025108750100000315
the Euclidean distance between the two characteristics obtained after the input network operation is respectively minimum,
Figure FDA00025108750100000316
namely a negative difficulty sample pair; the loss function finds out a positive difficulty sample pair and a negative difficulty sample pair corresponding to each picture of each pedestrian, and constrains the relationship between the characteristic distance of the positive difficulty sample pair and the characteristic distance of the negative difficulty sample pair;
for the feature pc1And the Batch Hard triple Loss is as follows:
Figure FDA00025108750100000317
in the formula (2)
Figure FDA00025108750100000318
Representing a feature pc extracted from the a picture of the ith pedestrian1
Figure FDA00025108750100000319
Representing features pc extracted from the p picture of the ith pedestrian1
Figure FDA00025108750100000320
Representing features pc extracted from the nth picture of the jth pedestrian1
For the global feature g, respectively applying a Batch Hard triple Loss and a Softmax Loss; the BatchHardTraplet Loss is as follows:
Figure FDA0002510875010000041
in the formula (3)
Figure FDA0002510875010000042
Representing the feature g extracted from the a picture of the ith pedestrian,
Figure FDA0002510875010000043
representing the feature g extracted from the p picture of the ith pedestrian,
Figure FDA0002510875010000044
representing a feature g extracted from the nth picture of the jth pedestrian; before applying SoftmaxLoss, g needs to be input into a Fully Connected Layer (FC Layer); the number of output neurons of the full connection layer is the total number C of the pedestrian identities of the training set, and parameters of the full connection layer are initialized randomly; the SoftmaxLoss of global feature g is:
Figure FDA0002510875010000045
in the formula (4)
Figure FDA0002510875010000046
Representing a feature g extracted from the jth picture of the ith pedestrian,
Figure FDA0002510875010000047
representing the identity of the pedestrian corresponding to the picture;
Figure FDA0002510875010000048
represents FC layer No
Figure FDA0002510875010000049
Weight corresponding to each output neuron, WkRepresenting the weight corresponding to the k output neuron of the FC layer;
the overall loss function of the network is:
Figure FDA00025108750100000410
in formula (5) < lambda >123For three lost weights, satisfy λ123=1;
Step 8, recording the network constructed in the step 3-6 as N; using a gradient descent algorithm, deriving the Loss function Loss in the step 7 and optimizing learnable parameters in the N through back propagation;
step 9, aligning the feature graph by using a space transformation network:
9-1, outputting a characteristic diagram F from a 4 th Block (Res 4Block) of the convolution base network in N4(three-dimensional tensor) obtaining a vector theta (theta) with the length of 6 through a residual connecting Block (Res Block, parameter random initialization) and a GAP layer111213212223) (ii) a Wherein theta is11122122For scaling and rotating the characteristic map, theta1323For translating the feature map;
9-2, using theta111213212223Outputting a profile F for Block 2 (Res2Block) of a convolutional base network in N2(tensor with size of H × W × C) is affine transformed to obtain a blank feature map F ″2(ii) a To F2For the eigen map of channel c (tensor of H × W), the coordinate of the pixel point is (x)s,ys) After affine transformation becomes (x)t,yt) The relationship between the two is:
Figure FDA0002510875010000051
9-3, blank feature map F ″, according to formula (6)2From F2Filling the upper sampling pixel to obtain an aligned feature map F'2(ii) a (ii) a In the process of affine-ing, the process of affine transformation,f' will appear2F corresponding to middle coordinate2Coordinate exceeds F2In the original range, the pixel values of the coordinates are set to 0; appear F ″2F corresponding to middle coordinate2When the coordinates are not pixel points, filling pixel values to the coordinates through bilinear interpolation:
Figure FDA0002510875010000052
in the formula (7)
Figure FDA0002510875010000053
Is F ″2The pixel value of the (m, n) position on the c-channel of (c),
Figure FDA0002510875010000054
is F2The pixel value of the (m, n) position on the c channel of (c);
step 10, processing the aligned characteristic graph:
for aligned feature map F'2Inputting the data into a new convolution network, wherein the new convolution network is formed by stacking Res3Block, Res 4Block and Res 5Block in ResNet-50 network pre-trained on ImageNet data set, and outputting a feature map T with the same size as the feature map T in step 3align(ii) a For TalignThe same operation as in step 3-6 is performed, and 1 global feature g is obtainedalignAnd 16 local and channel combination features
Figure FDA0002510875010000055
Note that the network constructed in steps 9-10 is Nalign,NalignThe system is composed of Res 1Block, Res2Block, Res3Block, Res 4Block and STN of a convolution base network in N, Res3Block, Res 4Block and Res 5Block in a new convolution network, and a convolution layer for compressing global characteristics and local and channel combination characteristics; for global feature galignAnd local and channel combination features
Figure FDA0002510875010000056
Optimizing N using the same loss function in step 7alignOf the parameter(s) to be learned.
3. The pedestrian re-identification method based on the local and channel combined features as claimed in claim 2, wherein the testing process comprises the following steps:
step 1, inputting a pedestrian picture to be inquired into NalignG to be outputalignAnd
Figure FDA0002510875010000061
connected to obtain the descriptor of the pedestrian
Figure FDA0002510875010000062
Is a 8704-dimensional feature vector;
step 2, obtaining pedestrian descriptors of all the pictures in the warehouse set through the step 1;
step 3, respectively calculating and storing cosine distances between the pedestrian descriptors to be inquired and each pedestrian descriptor in the warehouse set;
4, sequencing the stored distances in the order from small to large, and selecting the pedestrian pictures of the warehouse corresponding to the front k distances as the re-identification result of the pedestrian to be inquired;
and 5, judging whether the real identity of the pedestrian picture of the warehouse obtained by the proportion recognition is consistent with the identity of the pedestrian to be inquired to measure the recognition performance of the model.
CN202010460902.9A 2020-05-27 2020-05-27 Pedestrian re-identification method based on local and channel combination characteristics Active CN111709313B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010460902.9A CN111709313B (en) 2020-05-27 2020-05-27 Pedestrian re-identification method based on local and channel combination characteristics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010460902.9A CN111709313B (en) 2020-05-27 2020-05-27 Pedestrian re-identification method based on local and channel combination characteristics

Publications (2)

Publication Number Publication Date
CN111709313A true CN111709313A (en) 2020-09-25
CN111709313B CN111709313B (en) 2022-07-29

Family

ID=72537979

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010460902.9A Active CN111709313B (en) 2020-05-27 2020-05-27 Pedestrian re-identification method based on local and channel combination characteristics

Country Status (1)

Country Link
CN (1) CN111709313B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112232194A (en) * 2020-10-15 2021-01-15 广州云从凯风科技有限公司 Single-target human body key point detection method, system, equipment and medium
CN112686176A (en) * 2020-12-30 2021-04-20 深圳云天励飞技术股份有限公司 Target re-recognition method, model training method, device, equipment and storage medium
CN113255615A (en) * 2021-07-06 2021-08-13 南京视察者智能科技有限公司 Pedestrian retrieval method and device for self-supervision learning
CN113343909A (en) * 2021-06-29 2021-09-03 南京星云数字技术有限公司 Training method of multi-task classification network and pedestrian re-identification method
CN114170516A (en) * 2021-12-09 2022-03-11 清华大学 Vehicle weight recognition method and device based on roadside perception and electronic equipment
CN114299535A (en) * 2021-12-09 2022-04-08 河北大学 Feature aggregation human body posture estimation method based on Transformer
CN114299535B (en) * 2021-12-09 2024-05-31 河北大学 Transformer-based feature aggregation human body posture estimation method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108009528A (en) * 2017-12-26 2018-05-08 广州广电运通金融电子股份有限公司 Face authentication method, device, computer equipment and storage medium based on Triplet Loss
CN108399362A (en) * 2018-01-24 2018-08-14 中山大学 A kind of rapid pedestrian detection method and device
CN109583379A (en) * 2018-11-30 2019-04-05 常州大学 A kind of pedestrian's recognition methods again being aligned network based on selective erasing pedestrian
CN110543817A (en) * 2019-07-25 2019-12-06 北京大学 Pedestrian re-identification method based on posture guidance feature learning
CN110659573A (en) * 2019-08-22 2020-01-07 北京捷通华声科技股份有限公司 Face recognition method and device, electronic equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108009528A (en) * 2017-12-26 2018-05-08 广州广电运通金融电子股份有限公司 Face authentication method, device, computer equipment and storage medium based on Triplet Loss
CN108399362A (en) * 2018-01-24 2018-08-14 中山大学 A kind of rapid pedestrian detection method and device
WO2019144575A1 (en) * 2018-01-24 2019-08-01 中山大学 Fast pedestrian detection method and device
CN109583379A (en) * 2018-11-30 2019-04-05 常州大学 A kind of pedestrian's recognition methods again being aligned network based on selective erasing pedestrian
CN110543817A (en) * 2019-07-25 2019-12-06 北京大学 Pedestrian re-identification method based on posture guidance feature learning
CN110659573A (en) * 2019-08-22 2020-01-07 北京捷通华声科技股份有限公司 Face recognition method and device, electronic equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
WEILIN ZHONG 等: "Combining multilevel feature extraction and multi-loss learning for person re-identification", 《NEUROCOMPUTING》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112232194A (en) * 2020-10-15 2021-01-15 广州云从凯风科技有限公司 Single-target human body key point detection method, system, equipment and medium
CN112686176A (en) * 2020-12-30 2021-04-20 深圳云天励飞技术股份有限公司 Target re-recognition method, model training method, device, equipment and storage medium
CN112686176B (en) * 2020-12-30 2024-05-07 深圳云天励飞技术股份有限公司 Target re-identification method, model training method, device, equipment and storage medium
CN113343909A (en) * 2021-06-29 2021-09-03 南京星云数字技术有限公司 Training method of multi-task classification network and pedestrian re-identification method
CN113343909B (en) * 2021-06-29 2023-09-26 南京星云数字技术有限公司 Training method of multi-task classification network and pedestrian re-recognition method
CN113255615A (en) * 2021-07-06 2021-08-13 南京视察者智能科技有限公司 Pedestrian retrieval method and device for self-supervision learning
CN114170516A (en) * 2021-12-09 2022-03-11 清华大学 Vehicle weight recognition method and device based on roadside perception and electronic equipment
CN114299535A (en) * 2021-12-09 2022-04-08 河北大学 Feature aggregation human body posture estimation method based on Transformer
CN114170516B (en) * 2021-12-09 2022-09-13 清华大学 Vehicle weight recognition method and device based on roadside perception and electronic equipment
CN114299535B (en) * 2021-12-09 2024-05-31 河北大学 Transformer-based feature aggregation human body posture estimation method

Also Published As

Publication number Publication date
CN111709313B (en) 2022-07-29

Similar Documents

Publication Publication Date Title
CN111709313B (en) Pedestrian re-identification method based on local and channel combination characteristics
CN110443143B (en) Multi-branch convolutional neural network fused remote sensing image scene classification method
CN110348319B (en) Face anti-counterfeiting method based on face depth information and edge image fusion
CN107066559B (en) Three-dimensional model retrieval method based on deep learning
CN112766158B (en) Multi-task cascading type face shielding expression recognition method
CN109145745B (en) Face recognition method under shielding condition
CN112580590A (en) Finger vein identification method based on multi-semantic feature fusion network
CN108154133B (en) Face portrait-photo recognition method based on asymmetric joint learning
CN112580480B (en) Hyperspectral remote sensing image classification method and device
CN113361495A (en) Face image similarity calculation method, device, equipment and storage medium
CN108564040B (en) Fingerprint activity detection method based on deep convolution characteristics
CN108960142B (en) Pedestrian re-identification method based on global feature loss function
Mshir et al. Signature recognition using machine learning
CN110852292B (en) Sketch face recognition method based on cross-modal multi-task depth measurement learning
Elmikaty et al. Car detection in aerial images of dense urban areas
CN115311502A (en) Remote sensing image small sample scene classification method based on multi-scale double-flow architecture
CN114882537A (en) Finger new visual angle image generation method based on nerve radiation field
CN112329662B (en) Multi-view saliency estimation method based on unsupervised learning
CN114627424A (en) Gait recognition method and system based on visual angle transformation
CN113763417B (en) Target tracking method based on twin network and residual error structure
CN112001231B (en) Three-dimensional face recognition method, system and medium for weighted multitask sparse representation
CN112418262A (en) Vehicle re-identification method, client and system
CN110738194A (en) three-dimensional object identification method based on point cloud ordered coding
CN116188956A (en) Method and related equipment for detecting deep fake face image
CN115761888A (en) Tower crane operator abnormal behavior detection method based on NL-C3D model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant