CN117710903B

CN117710903B - Visual specific pedestrian tracking method and system based on ReID and Yolov5 double models

Info

Publication number: CN117710903B
Application number: CN202410161198.5A
Authority: CN
Inventors: 庄建军; 万超; 庄宇辰; 万理; 陈永强; 王楠; 庞明义
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2024-02-05
Filing date: 2024-02-05
Publication date: 2024-05-03
Anticipated expiration: 2044-02-05
Also published as: CN117710903A

Abstract

The invention discloses a visual specific pedestrian tracking method and system based on ReID and Yolov double models, belonging to the technical field of image recognition, wherein the method comprises the following steps: (1) Pedestrian data are obtained, and the data are preprocessed and stored; (2) Extracting the preprocessed data frame by frame, extracting person pictures in each frame of pictures by YOLOv, and storing the person pictures; (3) Acquiring a mark 1501 data set, randomly dividing a test set and a training set, and training Resnet a model to acquire a final model; (4) Inputting the trained model into the picture obtained in the step (2) through a transfer learning method, and extracting the characteristics; (5) Combining the extracted pictures into a tensor, transmitting the tensor to a ReID model, and carrying out feature extraction and normalization; (6) determining the magnitude of the minimum average distance and the distance threshold; (7) saving the final video; according to the invention, by means of YOLOv target detection of a specific class and ReID search of a specific ID, the function of tracking offender in collaborative search is realized.

Description

Visual specific pedestrian tracking method and system based on ReID and Yolov5 double models

Technical Field

The invention relates to the technical field of image recognition, in particular to a visual specific pedestrian tracking method and system based on ReID and Yolov5 double models.

Background

In recent years, with the great increase of the storage amount of non-motor vehicles in campuses and the improvement of the opening degree of universities, more and more non-motor vehicles enter the campuses and illegal parking occurs. The safety problem of campus traffic is increasingly outstanding, and the safety problem of campus traffic is deeply concerned by school leaders, teachers and students. Therefore, how to solve the safety problem of campus traffic is urgent.

Disclosure of Invention

The invention aims to: the invention aims to provide a visual specific pedestrian tracking method and system based on ReID and Yolov double models, which realize pedestrian re-identification of photo-video under the double models by means of model transformation through a self-training model of a mark 1501 dataset and a strong base line ReID and a Yolov5s.pt weight file initialization model, and search specific IDs frame by frame to solve the security problem of a campus.

The technical scheme is as follows: the invention discloses a visual specific pedestrian tracking method based on ReID and Yolov double models, which comprises the following steps:

(1) Acquiring pedestrian data by using a plurality of 1280 x 1080 high-definition cameras and SD cameras, preprocessing the data and storing the data;

(2) Extracting the preprocessed data frame by frame, extracting a person picture in each frame of picture by YOLOv, and storing the picture by extracting the person picture coordinates;

(3) Acquiring a mark 1501 data set, randomly dividing a test set and a training set, and training Resnet a model to acquire a final model;

(4) Inputting the trained model into the picture obtained in the step (2) through a transfer learning method, and extracting the characteristics;

(5) Combining the extracted pictures into a tensor, transmitting the tensor to a trained ReID model, extracting and normalizing the characteristics, and finding out the index of the minimum average distance by calculating the mean value of Euclidean distances between the query picture and each stored picture;

(6) Setting corresponding distance thresholds through different scenes, judging the size of the minimum average distance and the distance threshold, and outputting a final target boundary frame when the minimum average distance is smaller than the distance threshold, wherein the final target boundary frame comprises position coordinates, confidence coefficient and category information of the boundary frame;

(7) And saving the final video.

Further, in the step (1), the pretreatment is specifically as follows: labeling the data, including manual labeling and labeling with a DPM detector; fixing the image size to 256×128; each image is segmented with a probability level of 0.5, each image is decoded into 32-bit anchor point original pixel values, and the RGB channels are normalized.

Further, in the step (1), the preprocessed data is stored, and naming criteria are as follows: 000N_cNsN_000ABC_0N.jpeg or 000N_cNsN_000ABC_00.jpeg; wherein 000N represents the tag number of each person; cN represents the nth camera; sN represents an nth video clip; 000ABC represents the picture of the 000 th ABC frame of cNsN; 0N represents the Nth detection box on cNsN _000ABC frames.

Further, the step (3) specifically includes the following steps: the dimension of the Resnet model full-connection layer is modified to N; optimizing the model by adopting an Adam method; selecting identity loss, center loss and enhanced triplet loss as total losses in back propagation; the formula is as follows:

；

Wherein, For identity loss,/>For enhanced triplet loss,/>Is the center loss.

Further, the identity loss formula is as follows:

；

wherein n represents the number of training samples in each batch, provided with a given label Input image/>ThenRepresentation/>Identified as category/>Is used for predicting the probability of (1);

The enhanced triplet loss is as follows:

；

wherein (i, j, k) represents each anchor sample Triplet in each training batch,/>Representing the distance between positive sample pairs,/>Representing the distance between the negative pair of samples; for anchor samples/>And/>Is the corresponding positive sample; /(I)Expressed as anchor sample/>And its corresponding positive sample/>Weight value of Euclidean distance of/>Is a weight value, N represents the positive sample pair number, p represents the environment of the positive sample pair, and N represents the environment of the negative sample pair;

the euclidean distance between two samples is:

；

Wherein, And/>Representation/>And/>Corresponding feature vectors;

the definition formula of k is:

；

the square error is expressed as:

；

The center loss formula is as follows:

；

Wherein, Representing identity/>Is a class center of (c).

The invention discloses a visual specific pedestrian tracking system based on ReID and Yolov double models, which comprises the following components:

and a pretreatment module: the method comprises the steps that pedestrian data are obtained by using a plurality of 1280 x 1080 high-definition cameras and SD cameras, and are preprocessed and stored;

And a picture extraction module: the method comprises the steps of extracting preprocessed data frame by frame, extracting a person picture in each frame of picture by YOLOv, and storing the picture by extracting a person picture coordinate;

Model training module: the method comprises the steps of obtaining a mark 1501 data set, randomly dividing a test set and a training set, training Resnet models, and obtaining a final model;

And the feature extraction module is used for: the method comprises the steps of inputting a trained model into a picture obtained by a picture extraction module through a transfer learning method, and extracting features;

And an index module: the method comprises the steps of merging extracted pictures into a tensor, transmitting the tensor to a trained ReID model, extracting and normalizing characteristics, and finding out an index of the minimum average distance by calculating the average value of Euclidean distances between a query picture and each stored picture;

And an output module: the method comprises the steps of setting corresponding distance thresholds through different scenes, judging the size of the minimum average distance and the distance threshold, and outputting a final target boundary frame when the minimum average distance is smaller than the distance threshold, wherein the final target boundary frame comprises position coordinates, confidence and category information of the boundary frame;

And a storage module: for saving the final video.

Further, in the preprocessing module, the preprocessing is specifically as follows: labeling the data, including manual labeling and labeling with a DPM detector; fixing the image size to 256×128; each image is segmented with a probability level of 0.5, each image is decoded into 32-bit anchor point original pixel values, and the RGB channels are normalized.

Further, in the preprocessing module, the preprocessed data is stored, and naming criteria are as follows: 000N_cNsN_000ABC_0N.jpeg or 000N_cNsN_000ABC_00.jpeg; wherein 000N represents the tag number of each person; cN represents the nth camera; sN represents an nth video clip; 000ABC represents the picture of the 000 th ABC frame of cNsN; 0N represents the Nth detection box on cNsN _000ABC frames.

Further, in the model training module, the specific steps are as follows: the dimension of the Resnet model full-connection layer is modified to N; optimizing the model by adopting an Adam method; selecting identity loss, center loss and enhanced triplet loss as total losses in back propagation; the formula is as follows:

；

Wherein, For identity loss,/>For enhanced triplet loss,/>Is the center loss.

Further, in the model training module, the identity loss formula is as follows:

；

The enhanced triplet loss is as follows:

；

the euclidean distance between two samples is:

；

Wherein, And/>Representation/>And/>Corresponding feature vectors;

the definition formula of k is:

；

the square error is expressed as:

；

The center loss formula is as follows:

；

Wherein, Representing identity/>Is a class center of (c).

The beneficial effects are that: compared with the prior art, the invention has the following remarkable advantages: the function of cooperatively searching and tracking the offender under multiple cameras is realized by YOLOv for target detection of a specific class (person) and ReID for searching of a specific ID.

Drawings

Fig. 1 is a schematic diagram of the principle of the present invention.

Detailed Description

The technical scheme of the invention is further described below with reference to the accompanying drawings.

As shown in fig. 1, an embodiment of the present invention provides a method for tracking a specific pedestrian based on ReID and Yolov5 double models, which includes the following steps:

(1) Acquiring pedestrian data by using a plurality of 1280 x 1080 high-definition cameras and SD cameras, preprocessing the data and storing the data; the pretreatment is specifically as follows: labeling the data, including manual labeling and labeling with a DPM detector; fixing the image size to 256×128; dividing each image with 0.5 probability level, decoding each image into 32-bit locating point original pixel values, and carrying out normalization processing on RGB channels; storing the preprocessed data, wherein naming criteria are as follows: 000N_cNsN_000ABC_0N.jpeg or 000N_cNsN_000ABC_00.jpeg; wherein 000N represents the tag number of each person; cN represents the nth camera; sN represents an nth video clip; 000ABC represents the picture of the 000 th ABC frame of cNsN; 0N represents the Nth detection box on cNsN _000ABC frame

(3) Acquiring a mark 1501 data set, randomly dividing a test set and a training set, and training Resnet a model to acquire a final model; the method comprises the following steps: the dimension of the Resnet model full-connection layer is modified to N; optimizing the model by adopting an Adam method; selecting identity loss, center loss and enhanced triplet loss as total losses in back propagation; the formula is as follows:

；

Wherein, For identity loss,/>For enhanced triplet loss,/>Is the center loss.

The identity loss formula is as follows:

；

The enhanced triplet loss is as follows:

；

the euclidean distance between two samples is:

；

Wherein, And/>Representation/>And/>Corresponding feature vectors;

the definition formula of k is:

；

the square error is expressed as:

；

The center loss formula is as follows:

；

Wherein, Representing identity/>Is a class center of (c).

(5) And combining the extracted pictures into a tensor, transmitting the tensor to a trained ReID model, extracting and normalizing the characteristics, and finding the index of the minimum average distance by calculating the mean value of Euclidean distances between the query picture and each stored picture.

(7) And saving the final video.

By comparing various leading edge methods with the pedestrian re-recognition method of the invention, the invention has excellent performance in two data set training as shown in table 1.

TABLE 1 training comparison of this model with other leading edge models

The embodiment of the invention also provides a visual specific pedestrian tracking system based on ReID and Yolov double models, which comprises the following steps:

And a pretreatment module: the method comprises the steps that pedestrian data are obtained by using a plurality of 1280 x 1080 high-definition cameras and SD cameras, and are preprocessed and stored; the pretreatment is specifically as follows: labeling the data, including manual labeling and labeling with a DPM detector; fixing the image size to 256×128; each image is segmented with a probability level of 0.5, each image is decoded into 32-bit anchor point original pixel values, and the RGB channels are normalized. Storing the preprocessed data, wherein naming criteria are as follows: 000N_cNsN_000ABC_0N.jpeg or 000N_cNsN_000ABC_00.jpeg; wherein 000N represents the tag number of each person; cN represents the nth camera; sN represents an nth video clip; 000ABC represents the picture of the 000 th ABC frame of cNsN; 0N represents the Nth detection box on cNsN _000ABC frames.

Model training module: the method comprises the steps of obtaining a mark 1501 data set, randomly dividing a test set and a training set, training Resnet models, and obtaining a final model; the method comprises the following steps: the dimension of the Resnet model full-connection layer is modified to N; optimizing the model by adopting an Adam method; selecting identity loss, center loss and enhanced triplet loss as total losses in back propagation; the formula is as follows:

；

Wherein, For identity loss,/>For enhanced triplet loss,/>Is the center loss.

The identity loss formula is as follows:

；

The enhanced triplet loss is as follows:

；

the euclidean distance between two samples is:

；

Wherein, And/>Representation/>And/>Corresponding feature vectors;

the definition formula of k is:

；

the square error is expressed as:

；

The center loss formula is as follows:

；

Wherein, Representing identity/>Is a class center of (c).

And an index module: and the method is used for merging the extracted pictures into a tensor, transmitting the tensor to a trained ReID model, extracting and normalizing the characteristics, and finding the index of the minimum average distance by calculating the mean value of Euclidean distances between the query picture and each stored picture.

And a storage module: for saving the final video.

Claims

1. A method for visual specific pedestrian tracking based on ReID and Yolov5 double models, comprising the steps of:

；

Wherein, For identity loss,/>For enhanced triplet loss,/>Is the center loss; the identity loss formula is as follows:

；

wherein n represents the number of training samples in each batch, provided with a given label Input image/>Then/>Representation/>Identified as category/>Is used for predicting the probability of (1);

The enhanced triplet loss is as follows:

；

the euclidean distance between two samples is:

；

Wherein, And/>Representation/>And/>Corresponding feature vectors;

the definition formula of k is:

；

the square error is expressed as:

；

The center loss formula is as follows:

；

Wherein, Representing identity/>Is a class center of (2);

(7) And saving the final video.

2. The method for visual specific pedestrian tracking based on ReID and Yolov double models according to claim 1, wherein in the step (1), the preprocessing is specifically as follows: labeling the data, including manual labeling and labeling with a DPM detector; fixing the image size to 256×128; each image is segmented with a probability level of 0.5, each image is decoded into 32-bit anchor point original pixel values, and the RGB channels are normalized.

3. The method for visual specific pedestrian tracking based on ReID and Yolov double models according to claim 1, wherein in the step (1), the preprocessed data is stored, and naming criteria are as follows: 000N_cNsN_000ABC_0N.jpeg or 000N_cNsN_000ABC_00.jpeg; wherein 000N represents the tag number of each person; cN represents the nth camera; sN represents an nth video clip; 000ABC represents the picture of the 000 th ABC frame of cNsN; 0N represents the Nth detection box on cNsN _000ABC frames.

4. A visual specific pedestrian tracking system based on ReID and Yolov double models, comprising:

；

The enhanced triplet loss is as follows:

；

the euclidean distance between two samples is:

；

Wherein, And/>Representation/>And/>Corresponding feature vectors;

the definition formula of k is:

；

the square error is expressed as:

；

The center loss formula is as follows:

；

Wherein, Representing identity/>Is a class center of (2);

And a storage module: for saving the final video.

5. The visual specific pedestrian tracking system based on ReID and Yolov double models according to claim 4, wherein the preprocessing module is configured to: labeling the data, including manual labeling and labeling with a DPM detector; fixing the image size to 256×128; each image is segmented with a probability level of 0.5, each image is decoded into 32-bit anchor point original pixel values, and the RGB channels are normalized.

6. The visual specific pedestrian tracking system based on ReID and Yolov double models of claim 4, wherein the preprocessing module stores the preprocessed data, and naming criteria are as follows: 000N_cNsN_000ABC_0N.jpeg or 000N_cNsN_000ABC_00.jpeg; wherein 000N represents the tag number of each person; cN represents the nth camera; sN represents an nth video clip; 000ABC represents the picture of the 000 th ABC frame of cNsN; 0N represents the Nth detection box on cNsN _000ABC frames.