CN111709313A

CN111709313A - Pedestrian re-identification method based on local and channel combination characteristics

Info

Publication number: CN111709313A
Application number: CN202010460902.9A
Authority: CN
Inventors: 徐尔立; 翁立; 王建中
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2020-05-27
Filing date: 2020-05-27
Publication date: 2020-09-25
Anticipated expiration: 2040-05-27
Also published as: CN111709313B

Abstract

The invention provides a pedestrian re-identification method based on local and channel combination characteristics. According to the invention, various shielding situations are simulated in a data enhancement mode, and the robustness of the shielding problem is improved. The picture is simultaneously scaled, rotated and translated using the STN to align the pedestrian picture. And horizontally dividing the picture to obtain the characteristics of different parts. For the global features of the whole pedestrian picture, the pedestrian identity is correctly classified through classification loss, the features of the same pedestrian are distributed more closely through similarity loss, and the features of different pedestrians are distributed more distantly. For local and channel combination features, different patterns on these different body parts are compared by similarity loss. Finally, the two features are fused to be used as the pedestrian descriptors, and the discrimination of the pedestrian descriptors is further improved. Through improving the anti-sheltering and discrimination ability of the pedestrian descriptor, the pedestrian re-identification can be more accurate.

Description

Pedestrian re-identification method based on local and channel combination characteristics

Technical Field

The invention belongs to the field of computer vision and image retrieval, and relates to a pedestrian re-identification method based on local and channel combination characteristics. The method solves some common problems in the field of pedestrian re-identification.

Background

With the development and popularization of monitoring systems, more and more pedestrian image data are urgently pending. The pedestrian re-identification technology is to find out the image of the pedestrian from the images of the pedestrians shot by a certain camera and the images of the pedestrians shot by other cameras. The intelligent pedestrian detection system has wide application scenes in real life, such as intelligent security, criminal investigation, man-machine interaction and the like, and is also closely related to other fields such as pedestrian detection, pedestrian tracking and the like.

The pedestrian re-identification method commonly used at present is based on a Convolutional Neural Network (CNN). Some approaches therefore aim to design or refine network models to extract more discriminative pedestrian image features, such as the residual network ResNet-50 pre-trained on ImageNet data sets and fine-tuned on the pedestrian re-identification data sets. While some methods work on improving or designing the loss functions, the loss functions are mainly classified into two categories: 1) classifying the loss, treating each pedestrian as a particular class, such as cross-entropy loss (cross-entropy loss); 2) and similarity loss, namely, the relationship which restricts the similarity between pedestrian images, such as contrast loss (contrast loss), triple loss (triple loss) and quadruple loss (quadruple loss).

Disclosure of Invention

Aiming at the problems in the existing pedestrian re-identification field, the invention provides a pedestrian re-identification method based on local and channel combination characteristics. The method has the following advantages: 1) the network model improves the capability of resisting the shielding problem in a data enhancement mode; 2) solving the problem of misalignment of images of pedestrians through a Spatial Transformer Network (STN); 3) the local and channel combination characteristics with more discrimination are obtained by cutting the characteristic diagram and grouping the characteristic diagram channels; 4) by applying different loss functions to different features, the discrimination of the features is further improved. The method provided by the invention comprehensively solves the main problems of shielding, misalignment, large pedestrian appearance change and the like in pedestrian re-identification, so that the method has more accurate identification capability.

A pedestrian re-identification method based on local and channel combination features comprises the following procedures:

firstly, a training process: the neural network is trained to obtain the best network parameters. The samples in the training dataset consist of a pedestrian picture x and its corresponding pedestrian identity id (x), id (x) e { 1. C represents the total number of the identities of the pedestrians, and the pedestrian with one identity has a plurality of pictures. The method comprises the following specific steps:

step 1, sampling samples in a training set to generate small-batch data:

a small batch of data contains P x K pictures, namely P pedestrians with different identities, and K pictures of each pedestrian. If the number of pictures of one pedestrian is more than K in the training set, randomly sampling K pictures; if the number of the pictures is less than K, all the pictures are sampled, and the sampling is not enough and repeated.

Step 2, improving the anti-shielding capability of the model in a data enhancement mode:

2-1, generating a picture Pool (Pool) capable of storing pictures with different resolutions;

2-2, before each picture is input into the network, it will be in p₁Probability copying one of the small pictures into Pool, assuming the resolution of the picture is H × W, the resolution of one small picture, i.e. the picture block, falls randomly in the interval [0.1H,0.2H ]]×[0.1W,0.2W]In between, the locations are also randomly selected.

2-3, then with p₂And randomly selecting a picture block from Pool according to the probability to cover the picture, and randomly selecting the covered position.

Step 3, loading a pre-training network:

using the ResNet-50 network pre-trained on the ImageNet dataset, the structure before the Global Average Pooling (GAP) layer of the network is preserved and the step size of the last convolutional layer (convolutional layer) is set to 1, which is denoted as the "convolutional based network". After a picture with the resolution of 256 × 128 is input into the convolution base network, a tensor eigenmap T with the size of 16 × 8 × 2048 is output.

And 4, grouping the channels to obtain the characteristics of each group of channels:

dividing the tensor eigenmap T with the size of 16 × 8 × 2048 obtained in the step 3 into 4 groups along the channel (namely the last dimension) on average, wherein the size of the tensor eigenmap T of each group is 16 × 8 × 512 which is respectively recorded as T₁,T₂,T₃,T₄。

And 5, cutting the tensor characteristic diagram to obtain local characteristics:

each group of tensor characteristic map T obtained in the step 4₁,T₂,T₃,T₄The image is equally divided into 4 local tensor eigenmaps along the horizontal direction, the size of each local tensor eigenmap is 4 × 8 × 512, and the local tensor eigenmaps are respectively recorded as T₁₁～T₁₄,T₂₁～T₂₄,T₃₁～T₃₄,T₄₁～T₄₄. T obtaining 16 local tensor eigenmaps T through

steps

4 and 5₁₁～T₁₄,T₂₁～T₂₄,T₃₁～T₃₄,T₄₁～T₄₄. Each local tensor eigenmap represents the combined features of different locations and different channels.

And 6, compressing the characteristic diagram:

convolving the tensor eigen map T, wherein the size of a convolution kernel is 16 × 8 × 512, the number of the convolution kernels is 512, parameters are initialized randomly, and global features g with the size of 1 × 1 × 512 are obtained₁₁～T₁₄,T₂₁～T₂₄,T₃₁～T₃₄,T₄₁～T₄₄Performing convolution respectively, wherein the convolution kernel size corresponding to each local tensor characteristic diagram is 4 × 8 × 512, the number of the convolution kernels is 512, parameters are initialized randomly, and 16 local channel combined characteristics pc with the size of 1 × 1 × 512 are obtained₁～pc₁₆。

Step 7, applying different loss functions to different characteristics:

combining features pc for local channels₁～pc₁₆Applying Batch hard sample triple Loss (Batch HardScript Loss) separately)：

In the formula (1), X represents the small batch of data obtained by sampling in the step 1, and theta represents the parameter of the network.

Representing that the ith pedestrian corresponds to the a picture in the K pictures,

representing that the ith pedestrian corresponds to the p picture in the K pictures, and calling the p picture as a positive sample pair because the two pictures belong to the same pedestrian;

represents that the jth pedestrian corresponds to the nth picture in the K pictures because

And

belonging to different pedestrians and called as a negative sample pair. f. of_θ(x) D (x, y) represents the Euclidean Distance (Euclidean Distance) between the feature x and the feature y. m is a constant that constrains the relationship between the two feature pairs distance, [ x ]]₊Max (0, x). A picture of a pedestrian

In other words, traversing the pedestrian corresponds to each of the K pictures

Find a particular

So that

And

the Euclidean distance between the two characteristics obtained after the input network operation is maximum respectively,

namely a positive and difficult sample pair, and simultaneously, each picture (P-1) × K pictures in total) of the other pedestrians is traversed

Find a particular

So that

And

the Euclidean distance between the two characteristics obtained after the input network operation is respectively minimum,

namely a negative difficulty sample pair. The loss function finds out the corresponding positive difficulty and negative difficulty sample pairs of each picture of each pedestrian, and constrains the relationship between the positive difficulty sample pairs characteristic distance and the negative difficulty sample pairs characteristic distance.

For the feature pc₁And the Batch Hard triple Loss is as follows:

in the formula (2)

Representing a feature pc extracted from the a picture of the ith pedestrian₁，

Representing features pc extracted from the p picture of the ith pedestrian₁，

Representing features pc extracted from the nth picture of the jth pedestrian₁。

For the global signature g, the Batch Hard Triplet Loss and the Softmax Loss are applied, respectively. The BatchHard triple Loss is as follows:

in the formula (3)

Representing the feature g extracted from the a picture of the ith pedestrian,

representing the feature g extracted from the p picture of the ith pedestrian,

representing the feature g extracted from the nth picture of the jth pedestrian. Before applying Softmax Loss, g needs to be imported into a Fully Connected Layer (FC Layer). The number of output neurons of the full connection layer is the total number C of the pedestrian identities of the training set, and parameters of the full connection layer are initialized randomly. The Softmax Loss of the global feature g is:

in the formula (4)

Representing a feature g extracted from the jth picture of the ith pedestrian,

representing the identity of the pedestrian corresponding to the picture。

Represents FClayer No

Weight corresponding to each output neuron, W_kRepresents the weight corresponding to the k output neuron of the FClayer.

The overall loss function of the network is:

in formula (5) < lambda >₁,λ₂,λ₃For three lost weights, satisfy λ₁+λ₂+λ₃＝1。

And 8, recording the network constructed in the step 3-6 as N. Using a gradient descent algorithm, the Loss function Loss in step 7 is derived and the learnable parameters in N are optimized by back propagation.

Step 9, aligning the feature graph by using a space transformation network:

9-1, outputting a characteristic diagram F from a 4 th Block (Res 4Block) of the convolution base network in N₄(three-dimensional tensor) obtaining a vector theta (theta) with the length of 6 through a residual connecting Block (Res Block, parameter random initialization) and a GAP layer₁₁,θ₁₂,θ₁₃,θ₂₁,θ₂₂,θ₂₃). Wherein theta is₁₁,θ₁₂,θ₂₁,θ₂₂For scaling and rotating the characteristic map, theta₁₃,θ₂₃For translating the feature map.

9-2, using theta₁₁,θ₁₂,θ₁₃,θ₂₁,θ₂₂,θ₂₃Outputting a profile F for Block 2 (Res2Block) of a convolutional base network in N₂(tensor with size H × W × C) is affine transformed to obtain a blank feature map F "₂. To F₂For the eigen map of channel c (tensor of H × W), the coordinate of the pixel point is (x)^s,y^s) After affine transformation becomes (x)^t,y^t) The relationship between the two is:

9-3, blank feature map F according to formula (6) "₂From F₂Filling the upper sampling pixel to obtain an aligned feature map F'₂. . In an affine process, F' will appear "₂F corresponding to middle coordinate₂Coordinate exceeds F₂In the original range, the pixel value is set to 0 for these coordinates. Occurrence of F "₂F corresponding to middle coordinate₂When the coordinates are not pixel points, filling pixel values to the coordinates through bilinear interpolation:

in the formula (7)

Is F'₂The pixel value of the (m, n) position on the c-channel of (c),

is F₂The (m, n) position on the c-channel of (c).

Step 10, processing the aligned characteristic graph:

for aligned feature map F'₂Inputting the data into a new convolution network, wherein the new convolution network is formed by stacking Res3Block, Res 4Block and Res 5Block in ResNet-50 network pre-trained on ImageNet data set, and outputting a feature map T with the same size as the feature map T in step 3^align. For T^alignThe same operation as in step 3-6 is performed, and 1 global feature g is obtained^alignAnd 16 local and channel combination features

Note that the network constructed in steps 9-10 is N^align，N^alignThe convolutional code is composed of Res 1Block, Res2Block, Res3Block, Res 4Block and STN of a convolutional base network in N, Res3Block, Res 4Block and Res 5Block in a new convolutional network, and convolutional layers for compressing global features and local and channel combination features. For global feature g^alignAnd local and channel combination features

Optimizing N using the same loss function in step 7^alignOf the parameter(s) to be learned.

II, a test flow:

the test data set is divided into a query set and a warehouse set, wherein the query set comprises the pedestrian pictures with known identities, and the warehouse set comprises the pictures with the same identities as the pedestrians in the query set and the pictures with different identities from the pedestrians in the query set. The data set is constructed by shooting pictures of pedestrians by monitoring cameras with non-overlapping view angles, automatically marking a rectangular frame of the pedestrians by a pedestrian Detector (DPM), and finally keeping the pictures of the pedestrians in the rectangular frame and adding identity tags of the pedestrians, wherein the pictures of the same pedestrian in the query set and the warehouse set have different shooting view angles. The method comprises the following specific steps:

step 1, inputting a pedestrian picture to be inquired into N^alignG to be output^alignAnd

connected to obtain the descriptor of the pedestrian

Is a 8704 dimensional feature vector.

And 2, similarly obtaining the pedestrian descriptors of all the pictures in the warehouse through the step 1.

And 3, respectively calculating and storing cosine distances between the pedestrian descriptors to be inquired and each pedestrian descriptor in the warehouse set.

And 4, sequencing the stored distances from small to large, and selecting the pedestrian pictures of the warehouse corresponding to the front k distances as the re-identification result of the pedestrian to be inquired.

And 5, judging whether the real identity of the pedestrian picture of the warehouse obtained by the proportion recognition is consistent with the identity of the pedestrian to be inquired to measure the recognition performance of the model.

The invention has the following beneficial effects:

according to the invention, various shielding situations are simulated in a data enhancement mode, and the network improves the robustness to the shielding problem by processing the pictures artificially shielded. The picture is simultaneously scaled, rotated and translated using the STN to align the pedestrian picture. On the basis that the pictures are aligned, the pictures are simply horizontally divided, so that different body parts of the pedestrian can be well positioned, and the features of the different parts can be obtained (the operations of cutting, channel grouping and affine transformation on the feature map are equivalent to the same operation on the original picture). Different channels in the feature map will respond to different patterns (color, clothing type, gender, age, etc.), so the local and channel combination features can better locate different patterns on different body parts of a pedestrian. For the global features of the whole pedestrian picture, the pedestrian identity is correctly classified through classification loss, the features of the same pedestrian are distributed more closely through similarity loss, and the features of different pedestrians are distributed more distantly. For local and channel combined features, classification loss is not applicable because the information content is small, and the pedestrian identity cannot be correctly classified through the feature. But comparing the different patterns on these different body parts through similarity loss can make the model better distinguish these patterns, making the local and channel combination features more discriminative. Finally, the two features are fused to be used as the pedestrian descriptors, and the discrimination of the pedestrian descriptors is further improved. Through improving the anti-sheltering and discrimination ability of the pedestrian descriptor, the pedestrian re-identification can be more accurate.

Drawings

FIG. 1 is a flow chart of the training of the present invention;

FIG. 2 is an exemplary diagram of data enhancement;

FIG. 3 is a network constructed by training steps 3-6;

FIG. 4 is a network constructed by training steps 9-10;

FIG. 5 is a flow chart of the test of the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

The training flow of the pedestrian re-identification method based on the local and channel combined features is shown in fig. 1. And firstly carrying out data enhancement on batch training samples, inputting the samples subjected to data enhancement into a convolution base network, and outputting a characteristic diagram. Performing two different operations on the feature map, wherein the first operation is to compress the feature map to obtain a global feature; the second method is to divide the channel into groups and cut horizontally to generate sub-feature map, and then compress the sub-feature map to obtain the local and channel combination features. Different loss functions are applied to the global characteristics and the local and channel combination characteristics, the derivation is carried out on the total loss function, and the network is optimized by utilizing a back propagation algorithm. And aligning the feature maps output by Res2Block in the optimized network through STN, and inputting the aligned feature maps into a new convolution network to obtain an output feature map. The aligned global features and local and channel combination features are obtained for the feature map in the same manner as above, and the same loss function is applied to optimize the new network again.

The method comprises the following specific steps:

step 1, sampling samples in a training set to generate small-batch data:

a small batch of data contains P multiplied by K pictures, P pedestrians with different identities, and K pictures of each pedestrian. If the number of pictures of one pedestrian is more than K in the training set, randomly sampling K pictures; if the number of the pictures is less than K, all the pictures are sampled, and the sampling is not enough and repeated.

Step 2, improving the anti-shielding capability of the model through the data enhancement mode shown in fig. 2:

2-2, before each picture is input into the network, it will be in p₁Probability complexOne of the small pictures is stored in Pool, assuming that the resolution of the picture is H × W, the resolution of one small picture, i.e. the picture block, randomly falls within the interval [0.1H,0.2H ]]×[0.1W,0.2W]In between, the locations are also randomly selected.

Step 3, loading a pre-training network:

using the ResNet-50 network pre-trained on the ImageNet dataset, the structure before the Global Average Pooling (GAP) layer of the network is preserved and the step size of the last convolutional layer (convolutional layer) is set to 1, which is recorded as the convolutional basis network. A picture with a resolution of 256 × 128 is input into the convolution base network to output a tensor eigenmap T with a size of 16 × 8 × 2048.

dividing the tensor eigenmap T with the size of 16 × 8 × 2048 obtained in the step 3 into 4 groups along the channel, wherein the tensor eigenmap size of each group is 16 × 8 × 512 which is respectively marked as T₁,T₂,T₃,T₄。

steps

And 6, compressing the characteristic diagram:

convolving the tensor eigen-map T with a convolution kernel size of 16× 8 × 512, 512 in number, the parameters are initialized randomly to obtain global feature g of size 1 × 1 × 512, the same for T₁₁～T₁₄,T₂₁～T₂₄,T₃₁～T₃₄,T₄₁～T₄₄Performing convolution respectively, wherein the convolution kernel size corresponding to each local tensor feature map is 4 × 8 × 512, the parameters are initialized randomly, the number is 512, and 16 local channel combined features pc with the size of 1 × 1 × 512 are obtained₁～pc₁₆. The network N constructed in steps 3-6 is shown in fig. 3.

Step 7, applying different loss functions to different characteristics:

combining features pc for local channels₁～pc₁₆Batch hard sample triplet Loss (Batch hardtpriplet Loss) was applied separately.

For example, for the characteristic pc₁And the Batch Hard triple Loss is as follows:

the Softmax Loss is:

the overall loss function of the network is:

and 8, using a gradient descent algorithm to derive and reversely propagate the Loss function Loss in the step 7 to optimize learnable parameters in the N.

Step 9, aligning the feature graph by using a space transformation network:

9-3, blank feature map F according to formula (12) "₂From F₂Filling the upper sampling pixel to obtain an aligned feature map F'₂. In an affine process, F' will appear "₂F corresponding to middle coordinate₂Coordinate exceeds F₂In the original range, the pixel value is set to 0 for these coordinates. Occurrence of F "₂F corresponding to middle coordinate₂When the coordinates are not pixel points, filling pixel values to the coordinates through bilinear interpolation:

step 10, processing the aligned characteristic graph:

for the aligned feature map F "₂Inputting it into a new convolutional network, said new network being formed by ImaRes3Block, Res 4Block and Res 5Block in a pretrained ResNet-50 network on the geNet data set are stacked, and a feature map T with the same size as the feature map T in the step 3 is output^align. For T^alignThe same operation as in step 3-6 is performed, and 1 global feature g is obtained^alignAnd 16 local and channel combination features

Note that the network constructed in steps 9-10 is N^align，N^alignThe convolutional code is composed of Res 1Block, Res2Block, Res3Block, Res 4Block and STN of a convolutional base network in N, Res3Block, Res 4Block and Res 5Block in a new convolutional network, and a convolutional layer for compressing global characteristics and local and channel combination characteristics, and the specific structure is shown in FIG. 4. For global feature g^alignAnd local and channel combination features

The testing flow of the pedestrian re-identification method based on the local and channel combination characteristics is shown in fig. 5. And inputting the pedestrian picture to be inquired and all the pedestrian pictures in the warehouse into a trained network, and respectively outputting pedestrian descriptors of the pedestrian pictures. And calculating cosine distances among the pedestrian descriptors, and selecting the front k warehouse pedestrian pictures corresponding to the minimum distance as re-identification results of the pedestrian pictures to be inquired. Whether the identity of the heavily recognized pedestrian is consistent with the identity of the pedestrian to be inquired or not is compared to judge the quality of the model.

The method comprises the following specific steps:

connected to obtain the descriptor of the pedestrian

Is a 8704 dimensional feature vector.

Claims

1. A pedestrian re-identification method based on local and channel combination features is characterized by comprising the following procedures:

firstly, a training process: training a neural network to obtain optimal network parameters; a sample in the training data set consists of a pedestrian picture x and a corresponding pedestrian identity ID (x), wherein the ID (x) belongs to { 1., C }; c represents the total number of the identities of the pedestrians, and the pedestrian with one identity has a plurality of pictures;

II, a test flow:

the test data set is divided into an inquiry set and a warehouse set, wherein the inquiry set comprises the pedestrian pictures with known identities, and the warehouse set comprises the pictures with the same identities as the pedestrians in the inquiry set and the pictures with different identities from the pedestrians in the inquiry set; the data set is constructed by shooting pictures of pedestrians by monitoring cameras with non-overlapping view angles, automatically marking a rectangular frame of the pedestrians by a pedestrian Detector (DPM), and finally keeping the pictures of the pedestrians in the rectangular frame and adding identity tags of the pedestrians, wherein the pictures of the same pedestrian in the query set and the warehouse set have different shooting view angles.

2. The pedestrian re-identification method based on the local and channel combined features as claimed in claim 1, wherein the training process comprises the following steps:

step 1, sampling samples in a training set to generate small-batch data:

the small batch of data comprises P multiplied by K pictures, namely P pedestrians with different identities, wherein each pedestrian has K pictures; if the number of pictures of one pedestrian is more than K in the training set, randomly sampling K pictures; if the number of the pictures is less than K, sampling all the pictures, and repeatedly sampling when the number of the pictures is insufficient;

2-2, before each picture is input into the network, it will be in p₁Probability copying one small picture into Pool, assuming the resolution of picture is H × W, the resolution of one small picture, i.e. picture block, is randomly in the interval [0.1H,0.2H ]]×[0.1W,0.2W]Position is also randomly selected;

2-3, then with p₂Randomly selecting a picture block from Pool according to the probability to cover the picture, and randomly selecting the covered position;

step 3, loading a pre-training network:

using the ResNet-50 network pre-trained on the ImageNet dataset, the structure before the Global Average Pooling (GAP) layer of this network is preserved, and the step size of the last convolutional layer (convolutional layer) is set to 1, which is denoted as the "convolutional basis network"; inputting a picture with the resolution of 256 multiplied by 128 into a convolution base network, and outputting a tensor eigen map T with the size of 16 multiplied by 8 multiplied by 2048;

dividing the tensor eigenmap T with the size of 16 × 8 × 2048 obtained in the step 3 into 4 groups along the channel (namely the last dimension) on average, wherein the size of the tensor eigenmap T of each group is 16 × 8 × 512 which is respectively recorded as T₁,T₂,T₃,T₄；

each group of tensor characteristic map T obtained in the step 4₁,T₂,T₃,T₄And averagely cutting the image into 4 local tensor eigenmaps along the horizontal direction, wherein the size of each local tensor eigenmap is 4 × 8 × 512 which is divided into four partsRespectively recorded as T₁₁～T₁₄,T₂₁～T₂₄,T₃₁～T₃₄,T₄₁～T₄₄(ii) a T obtaining 16 local tensor eigenmaps T through steps 4 and 5₁₁～T₁₄,T₂₁～T₂₄,T₃₁～T₃₄,T₄₁～T₄₄(ii) a Each local tensor feature map represents combined features of different positions and different channels;

and 6, compressing the characteristic diagram:

convolving the tensor eigen map T, wherein the size of a convolution kernel is 16 × 8 × 512, the number of the convolution kernels is 512, parameters are initialized randomly to obtain global features g with the size of 1 × 1 × 512, and similarly, the tensor eigen map T is convolved₁₁～T₁₄,T₂₁～T₂₄,T₃₁～T₃₄,T₄₁～T₄₄Performing convolution respectively, wherein the convolution kernel size corresponding to each local tensor characteristic diagram is 4 × 8 × 512, the number of the convolution kernels is 512, parameters are initialized randomly, and 16 local channel combined characteristics pc with the size of 1 × 1 × 512 are obtained₁～pc₁₆；

Step 7, applying different loss functions to different characteristics:

combining features pc for local channels₁～pc₁₆Batch hard sample triple Loss (Batch hardtpriplet Loss) was applied separately:

in the formula (1), X represents the small batch of data obtained by sampling in the step 1, and theta represents the parameter of the network;

And

belonging to different pedestrians and called as a negative sample pair; f. of_θ(x) Representing the characteristics of the picture x output after the input network operation, and D (x, y) representing the Euclidean Distance (Euclidean Distance) between the characteristics x and y; m is a constant that constrains the relationship between the two feature pairs distance, [ x ]]₊Max (0, x); a picture of a pedestrian

In other words, traversing the pedestrian corresponds to each of the K pictures

Find a particular

So that

And

Find specialIs fixed

So that

And

namely a negative difficulty sample pair; the loss function finds out a positive difficulty sample pair and a negative difficulty sample pair corresponding to each picture of each pedestrian, and constrains the relationship between the characteristic distance of the positive difficulty sample pair and the characteristic distance of the negative difficulty sample pair;

for the feature pc₁And the Batch Hard triple Loss is as follows:

in the formula (2)

Representing features pc extracted from the nth picture of the jth pedestrian₁；

For the global feature g, respectively applying a Batch Hard triple Loss and a Softmax Loss; the BatchHardTraplet Loss is as follows:

in the formula (3)

Representing the feature g extracted from the a picture of the ith pedestrian,

representing the feature g extracted from the p picture of the ith pedestrian,

representing a feature g extracted from the nth picture of the jth pedestrian; before applying SoftmaxLoss, g needs to be input into a Fully Connected Layer (FC Layer); the number of output neurons of the full connection layer is the total number C of the pedestrian identities of the training set, and parameters of the full connection layer are initialized randomly; the SoftmaxLoss of global feature g is:

in the formula (4)

Representing a feature g extracted from the jth picture of the ith pedestrian,

representing the identity of the pedestrian corresponding to the picture;

represents FC layer No

Weight corresponding to each output neuron, W_kRepresenting the weight corresponding to the k output neuron of the FC layer;

the overall loss function of the network is:

in formula (5) < lambda >₁,λ₂,λ₃For three lost weights, satisfy λ₁+λ₂+λ₃＝1；

Step 8, recording the network constructed in the step 3-6 as N; using a gradient descent algorithm, deriving the Loss function Loss in the step 7 and optimizing learnable parameters in the N through back propagation;

step 9, aligning the feature graph by using a space transformation network:

9-1, outputting a characteristic diagram F from a 4 th Block (Res 4Block) of the convolution base network in N₄(three-dimensional tensor) obtaining a vector theta (theta) with the length of 6 through a residual connecting Block (Res Block, parameter random initialization) and a GAP layer₁₁,θ₁₂,θ₁₃,θ₂₁,θ₂₂,θ₂₃) (ii) a Wherein theta is₁₁,θ₁₂,θ₂₁,θ₂₂For scaling and rotating the characteristic map, theta₁₃,θ₂₃For translating the feature map;

9-2, using theta₁₁,θ₁₂,θ₁₃,θ₂₁,θ₂₂,θ₂₃Outputting a profile F for Block 2 (Res2Block) of a convolutional base network in N₂(tensor with size of H × W × C) is affine transformed to obtain a blank feature map F ″₂(ii) a To F₂For the eigen map of channel c (tensor of H × W), the coordinate of the pixel point is (x)^s,y^s) After affine transformation becomes (x)^t,y^t) The relationship between the two is:

9-3, blank feature map F ″, according to formula (6)₂From F₂Filling the upper sampling pixel to obtain an aligned feature map F'₂(ii) a (ii) a In the process of affine-ing, the process of affine transformation,f' will appear₂F corresponding to middle coordinate₂Coordinate exceeds F₂In the original range, the pixel values of the coordinates are set to 0; appear F ″₂F corresponding to middle coordinate₂When the coordinates are not pixel points, filling pixel values to the coordinates through bilinear interpolation:

in the formula (7)

Is F ″₂The pixel value of the (m, n) position on the c-channel of (c),

is F₂The pixel value of the (m, n) position on the c channel of (c);

step 10, processing the aligned characteristic graph:

for aligned feature map F'₂Inputting the data into a new convolution network, wherein the new convolution network is formed by stacking Res3Block, Res 4Block and Res 5Block in ResNet-50 network pre-trained on ImageNet data set, and outputting a feature map T with the same size as the feature map T in step 3^align(ii) a For T^alignThe same operation as in step 3-6 is performed, and 1 global feature g is obtained^alignAnd 16 local and channel combination features

Note that the network constructed in steps 9-10 is N^align，N^alignThe system is composed of Res 1Block, Res2Block, Res3Block, Res 4Block and STN of a convolution base network in N, Res3Block, Res 4Block and Res 5Block in a new convolution network, and a convolution layer for compressing global characteristics and local and channel combination characteristics; for global feature g^alignAnd local and channel combination features

3. The pedestrian re-identification method based on the local and channel combined features as claimed in claim 2, wherein the testing process comprises the following steps:

connected to obtain the descriptor of the pedestrian

Is a 8704-dimensional feature vector;

step 2, obtaining pedestrian descriptors of all the pictures in the warehouse set through the step 1;

step 3, respectively calculating and storing cosine distances between the pedestrian descriptors to be inquired and each pedestrian descriptor in the warehouse set;

4, sequencing the stored distances in the order from small to large, and selecting the pedestrian pictures of the warehouse corresponding to the front k distances as the re-identification result of the pedestrian to be inquired;