CN113158891B

CN113158891B - Cross-camera pedestrian re-identification method based on global feature matching

Info

Publication number: CN113158891B
Application number: CN202110423474.7A
Authority: CN
Inventors: 李晓春; 吴狄娟; 秦勇; 朱锦校
Original assignee: Hangzhou Pixel Technology Co ltd
Current assignee: Hangzhou Pixel Technology Co ltd
Priority date: 2021-04-20
Filing date: 2021-04-20
Publication date: 2022-08-19
Anticipated expiration: 2041-04-20
Also published as: CN113158891A

Abstract

The invention discloses a cross-camera pedestrian re-identification method based on global feature matching, and aims to solve the problem that a face tracking technology fails under the condition of crossing cameras in public places. Firstly, a small number of manually labeled samples are constructed for different scenes, deep learning network model training is carried out, label generation and statistics are well carried out in subsequent matching, a large number of non-manually labeled samples are generated, after manual screening, a screened data set is used for carrying out enhanced training detection model, and therefore the subsequent detection accuracy is improved. The invention can effectively improve the landing efficiency of the re-recognition project, provides a realization step for re-recognition detection deployment, and greatly reduces the workload of labeling workers.

Description

Cross-camera pedestrian re-identification method based on global feature matching

Technical Field

The invention relates to the technical field of image recognition and computer vision, in particular to a cross-camera pedestrian re-recognition method based on global feature matching.

Background

The Re-identification (Re-identification) of pedestrians is one of the very key tasks in computer vision, is used for combining the pedestrian detection and pedestrian tracking technologies, and is often applied to various fields such as intelligent video monitoring and security under different camera shooting.

The face tracking method aims at the special scene that the face tracking technology is invalid due to the fact that face information of pedestrians in a public place cannot be shot by a monitoring camera. Under the real condition, the characteristics of the pedestrian are easily influenced by differences of light intensity, shading, posture change, camera image quality and the like, so that the performance of directly transferring the model in the cross-lens scene to the required scene is not good. Therefore, the requirement on the data set is very critical when the technology is used, so that the construction of the data set is very important. The construction of the re-identification data set depends on manual marking, a lot of data often needs to be marked in several months, the marking time period greatly delays the falling of the project, and the marking work consumes a large amount of human capital.

Disclosure of Invention

The invention aims to provide a cross-camera pedestrian re-identification method based on global feature matching to overcome the defects in the prior art.

In order to achieve the technical purpose, the technical scheme of the invention is realized as follows:

a cross-camera pedestrian re-identification method based on global feature matching comprises the following steps:

1) filing the data into a data set conforming to the format according to the requirement of a training set required by a target detection network, performing detection training after extracting image features by using a constructed backbone network to improve the accuracy in the step of re-recognition detection preprocessing, calculating a Loss formula (1-1) according to the image after extracting the features and a detection frame after labeling, performing chain rule reverse transmission on the model by updating the Loss value to update a weight parameter to obtain target detection network training, and finally obtaining a model M1,

wherein S in the formula is the set detection network quantity; b is the number of the detection frames;

for the detection parameter, 1 is set if the target is detected, and 0 is set if the target is not detected;

as undetected parameters, undetected eyesSetting parameters to be 1 if the target is detected, and setting parameters to be 0 if the target is detected; b is a mixture of _x ,

Respectively corresponding to the values, p, of the detection box and the actual mark box _c Is a probability under a certain class, c _i Is the confidence level;

2) constructing a backbone network, extracting global features of images, then carrying out GAP (global average posing) global averaging, then extracting features of pedestrian images, then converting the pedestrian images into feature vectors after GAP as shown in the following formula (1-2),

wherein a is ₁₁ ,a ₁₂ ...,a _HW For one of the C channels, the value of H.W parameters is formed, each channel forms a characteristic value C according to GAP, and finally [ C,1 ] is formed according to C]A matrix of dimensions is formed by a matrix of dimensions,

the model calculates Loss through a formula (1-3), and then reversely transmits and updates the weight parameters through a chain rule:

Loss＝max(d(a,p)-d(a,n)+margin,0) (1-3)

wherein a characteristic vector generated when a pedestrian is trained is obtained through a formula (1-2) according to a backbone network, p is the characteristic of the pedestrian, n is the characteristic of a non-pedestrian, margin is a set preset value,

finally, after a certain number of iterations, a model M2 is obtained through training of the pedestrian feature extraction network, so that the robustness of the actual scene of the feature extraction network is improved;

3) after a video file required to be marked is input, a detection frame rate T and an interesting area A can be set according to a user, a specific frame is extracted according to the set frame rate, the specific frame is stored as a picture, and only the area A is detected according to the set interesting area;

4) detecting the input video by using the model M1 constructed in the step 1) according to the detection frame rate T set in the step 3), setting an area of interest for the detected pedestrian, storing the data detected in the area of interest, and storing the stored data according to a specified form command so as to facilitate subsequent retrieval;

5) storing the video frames frame by frame according to the frame rate T set in the step 3), and increasing the reading speed for subsequent re-screening and inspection;

6) according to the intercepted pedestrian pictures, feature extraction is carried out on the N pedestrian pictures by using the feature extraction network model M2 constructed in the step 2), the extracted features are shown in the following formula, a C-dimensional vector is generated, and a matrix of [ C, N ] is formed after integration:

f ₁ ＝[a ₁ ,a ₂ ..,a _c ] (1-4)

wherein a single pedestrian generates a C-dimensional vector f ₁ As shown in the formula (1-4), unsupervised clustering is carried out according to the formula (1-5), and classification is carried out in a vector cyclic convolution mode to form independent feature numbers ID of pedestrians;

wherein d is _n,n-1 Is the distance value of the nth and the n-1 th eigenvector, f _n And f _n-1 Corresponding to the characteristic vectors of the nth pedestrian and the n-1 pedestrian;

obtaining a characteristic distance matrix D according to the calculation formula (1-5), and obtaining matrix parameters as follows:

D＝[d _1,2 ,d _1,3 ,....,d _n,n-1 ] (1-6)

wherein d is _1,2 Representing the characteristic distance of the 1 st and 2 nd pedestrians, and so on;

clustering and dividing the matrix according to a preset proximity threshold value L, and finally generating classified pedestrians and constructing a subsequent sample label collection after the division;

7) according to the pedestrian feature numbers and the camera numbers extracted in the step 6), numbering in the order of frame numbers, and sequencing to obtain a tag file so as to facilitate subsequent searching by cross-camera tracking;

8) when a cross-camera tracking mode is adopted, selecting and intercepting the pedestrian P from the video frame, and extracting the features of the pedestrian P by using the model M2 constructed in the step 2) to obtain a feature vector f _p Setting an interval with a time difference S near according to the video frame serial number and the camera naming number currently intercepted, the time information shot by the camera and the space information of the camera, constructing a pedestrian database Q to be inquired according to the image intercepted in the step 5), and constructing a feature matrix database f according to the interval _Q Calculating the vector inner product distance, sequencing the vector inner product distance, obtaining the sequence of the intercepted pedestrian pictures according to the nearest frame so as to obtain a data set of the pedestrian P among different cameras, and finally restoring the track process of tracking the pedestrian;

9) the label data set after artificial inspection is integrated, and data enhancement and data amplification are carried out on the condition of sample shortage so as to achieve sample balance;

10) repeating steps 1) and 2) on the integrated samples to improve the accuracy of the models M1 and M2 to reduce human capital for subsequent manual testing.

The invention has the beneficial effects that: the invention not only can greatly shorten the working period of the marking worker, but also provides an effective technical means for promoting the implementation and application of the technical scheme of pedestrian re-identification, and the consumed human capital can be greatly reduced from the manual tracking of the target object under the cross camera to the manual review and adjustment in the later period, thereby having important realization significance;

the invention can not only track and semi-automatically label the pedestrian target, but also can be applied to labeling tasks and tracking tasks of image re-identification of automobiles, commodities and the like. The invention takes a yo (yo only look once) as a target detection network as a detection part of preprocessing, and has the functions of segmenting a detection video frame, extracting a target, intercepting the target and the like. After the features of the intercepted images of the pedestrians are extracted, clustering is carried out in an unsupervised mode, so that data sets of different pedestrian classifications are obtained, convenience is brought to follow-up manual screening, and the landing scheme efficiency of a follow-up re-recognition technology is improved.

Drawings

FIG. 1 is an overall flow diagram;

FIG. 2 is a neural network training flow diagram;

FIG. 3 is a cross-camera tracking flow diagram;

FIG. 4 is a flow chart of sample integration extraction.

Detailed Description

The technical solution in the embodiments of the present invention is clearly and completely described below with reference to the drawings in the embodiments of the present invention.

As shown in fig. 1, a cross-camera pedestrian re-identification method based on global feature matching according to an embodiment of the present invention includes the following steps:

step 1: for example, according to a pedestrian target detection network Yolo, a pedestrian training set containing a multi-background environment is manufactured, and a specific number of data sets are constructed by taking COCO as a reference in the training set format. The number of the training sets and the number of the verification sets are distributed in a ratio of 3:1, and the verification sets do not carry samples in the training sets. After learning rate such as 0.00001 and maximum iteration times such as 80000 are set, inverse transfer operation is performed by using formula (1-1), so that a feature model M1 is obtained through training, and a training flow chart is shown in FIG. 2.

Step 2: and (2) according to the constructed pedestrian data set in the step (1), constructing certain pedestrian re-identification sample sets with the same pedestrian samples in different environments.

According to a naming rule such as [ pid ] _ c [ num ] s [ num ] _ frame ]. jpg, for example, an open-source data set such as a Market1501 is utilized to carry out integration construction, integration classification needs to be carried out on an original data set and different cameras in the integration process, naming under the condition that the names of the cameras different from the original data set are increased by degrees with the original data c [ num ], and [ pid ] is increased by degrees.

Three sample sets of train, query and galery are generated. The method comprises the following steps that a train is used as a training sample, at least 12 pedestrians have IDs, a query is used as a test set, a galery is used as a test sample set of the test set, the train is used for training in the training process, the query is used as the test set and is not overlapped with the train sample set, the number of people in the train and the query is 1:1 in proportion, but only 3 to 6 samples of the query sample are reserved, so that query can be conveniently carried out in the subsequent galery. The pedestrian image feature extraction is carried out according to the backbone network Resnet-50, for example, the comparison of different pedestrian IDs is carried out after GAP is carried out on the finally formed feature graph, and the reverse transmission is carried out by utilizing the formula (1-3) Loss. When a certain number of iterations is reached, the setting 400 is carried out, so as to obtain the feature extraction model M2.

And step 3: setting the frame rate of the truncation frequency, for example setting T to 20, means to perform detection and extraction sequentially every 20 frames, and according to the set region of interest, as the width and height of the image are known: w, H, 1280, 720 sets the detection area of the image area to be within area a as follows:

and 4, step 4: according to the set frame rate, every 20 frames pass, the video image is detected for the region of interest by using the model M1 to generate a detection frame, the detection frame is screened according to the preset conditions, then the frame is extracted, and the picture in the frame is intercepted and stored, for example, the command format adopts the following form: [ video ] _[ framework ] _[ classname ] _[ bbox ]. jpg to facilitate subsequent retrieval.

And 5: according to the set frame rate, the video frame is intercepted, for example, in the following named form: [ framenum ]. jpg, for subsequent retrieval of the query.

And 6: the network model M2 is extracted based on the training features and a minimum interval threshold L is set, such as 0.5. Extracting the features of the pedestrian image intercepted in the step4, and setting [ C, N ] according to the 2048-dimensional C-dimensional vector of the pedestrian with the N set to be 1000]Obtained in the form [2048,1000 ]]And N is the number of the intercepted pedestrian pictures. And performing inner product operation on the matrix to obtain the similarity of a single pedestrian picture f ₁ The vectors are as follows.

f ₁ ＝[0.1112232,0.000067,0.800022,....,0.2] (1-5)

The cosine distances between different persons are obtained through vector inner product operation, such as values of 0.12, 0.67 and 0.8 are calculated.

D＝[0.12,0.67,0.8,....,0.2] (1-6)

They are classified as pedestrians according to a minimum defined L value, for example 0.5. Therefore, unsupervised clustering is carried out on the pedestrians, and the pedestrians are classified into different IDs such as 0,1 folders.

And 7: and (4) according to the obtained filing folder and the pre-intercepted picture, constructing dictionary-form data for the folder generated in the step (6), and integrating a data dictionary form so as to facilitate subsequent query and sequencing. In order to facilitate the inquiry of subsequent labeling personnel, a new data structure is constructed on the basis of the pedestrian serial number [ framenum ], the pedestrian serial numbers are reintegrated, and then reordering is carried out.

And 8: when a cross-camera tracking mode is adopted, a pedestrian needing to be tracked is intercepted to obtain a picture P, and after the feature of the picture P is extracted through a model M2, a feature vector f is obtained _p Because the captured picture is according to the rule of naming Step4, the front-back spatial relationship can be generated according to the naming rule of video and frame name, and the picture with the set time difference S of 2 minutes is used as the video of the adjacent camera. Therefore, according to the relation, the pedestrian data Q to be inquired is characterized by the same model M2 to obtain a multidimensional characteristic vector matrix f _Q And after the inner product operation is carried out on the vector matrix, a sample set of the nearest time region and the space region is obtained, and the path of the characteristic pedestrian can be returned according to the naming rule.

And step 9: and modifying the pedestrians with corresponding mislabeled serial numbers through the artificially inspected data set, deleting the pedestrians with misdetection, integrating the pictures frame by frame according to the integrated labeled labels, and counting the integrated samples. And specific data enhancement is carried out on the condition of too little or insufficient data quantity, and the enhancement mode adopts post-processing processes such as cutting, random erasing and the like, so as to carry out a data enhancement sample completed after integration. And then randomly scrambling the samples according to train, query and galery samples, and distributing the samples in a 1:1 mode.

Step 10: and (3) continuing to step 1 and step 2 to improve the robustness of the lower model of the specific scene, and iteratively reducing the consumption of the artificial capital.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A cross-camera pedestrian re-identification method based on global feature matching is characterized by comprising the following steps:

1) filing a data set conforming to a format according to the requirement of a training set required by a target detection network, performing detection training after extracting image features by using a constructed backbone network for a model, calculating according to the image after extracting the features and a detection frame after labeling, performing chain rule reverse transmission on the model by updating a Loss value, obtaining the training for the target detection network, and finally obtaining a model M1, wherein the Loss formula is as follows:

wherein S is the set number of detection networks; b is the number of the detection frames;

to detect the parameter, set to 1 if the target is detected, and set to 0 if not detected;

if the parameter is not detected, setting the parameter to be 1 if the target is not detected, and setting the parameter to be 0 if the target is detected; b _x ,

Corresponding to the values, p, of the detection box and the actual label box, respectively _c Is a probability under a certain class, c _i Is the confidence level;

2) constructing a backbone network, extracting global features of the image, carrying out GAP global averaging, then obtaining a feature vector of the pedestrian image after GAP after extracting the features as shown in the following formula (1-2),

wherein, a ₁₁ ,a ₁₂ ...,a _HW For one of the C channels with the value of H.W parameters, each channel forms a characteristic value C according to GAP, and finally forms [ C,1 ] according to C]A matrix of dimensions is formed by a matrix of dimensions,

Loss＝max(d(a,p)-d(a,n)+margin,0) (1-3)

obtaining a feature vector generated when a pedestrian is trained according to a backbone network through a formula (1-2), wherein p is the feature of the pedestrian, n is the feature of a non-pedestrian of the type, margin is a set preset value, and finally obtaining a training obtaining model M2 of a pedestrian feature extraction network after iteration to a certain number;

3) after a video file required to be marked is input, a detection frame rate T and an interested area A are set according to a user, a specific frame is extracted according to the set frame rate, the specific frame is stored as a picture, and only the area A is detected according to the set interested area;

4) detecting the input video by using the model M1 constructed in the step 1) according to the detection frame rate T set in the step 3), setting an area of interest for the detected pedestrian, storing the data obtained by detection in the area of interest, and storing the stored data according to a specified form command;

5) storing the video frames frame by frame according to the frame rate T set in the step 3);

f ₁ ＝[a ₁ ,a ₂ ..,a _c ] (1-4)

wherein a single pedestrian generates a C-dimensional vector f ₁ As shown in the above equations (1-4), then unsupervised clustering is performed according to the equations (1-5),

wherein d is _n,n-1 Is the distance value of the n-th to the n-1-th feature vector, f _n And f _n-1 Corresponding to the feature vectors of the n-th pedestrian and the n-1-th pedestrian,

D＝[d _1,2 ,d _1,3 ,....,d _n,n-1 ] (1-6)

clustering and dividing the matrix according to a preset approach threshold value L, and finally generating classified pedestrians and constructing a subsequent sample label set after division;

7) sequencing according to the pedestrian feature numbers extracted in the step 6), the camera numbers and the number of the sequential frames to obtain tag files;

8) when a cross-camera tracking mode is adopted, selecting and intercepting the pedestrian P from the video frame, and extracting the features of the pedestrian P by using the model M2 constructed in the step 2) to obtain a feature vector f _p Setting an interval with a time difference of S in the vicinity according to the currently captured video frame number and the camera naming number and the time information shot by the camera and the space information of the camera, constructing a pedestrian database Q to be queried according to the image captured in the step 5), and constructing the pedestrian database Q according to the intervalBuilding a feature matrix library f _Q Calculating the vector inner product distance, sequencing the vector inner product distance, obtaining the sequence of the intercepted pedestrian pictures according to the nearest frame so as to obtain a data set of the pedestrian P among different cameras, and finally restoring the track process of tracking the pedestrian;

9) the label data set after artificial inspection is integrated, and data enhancement and data amplification are carried out on the condition of insufficient samples so as to achieve sample balance;

10) repeat steps 1) and 2) for the integrated samples.