CN114863366A

CN114863366A - Cross-camera multi-target tracking method based on prior information

Info

Publication number: CN114863366A
Application number: CN202210596963.7A
Authority: CN
Inventors: 杨惠雯; 林宇; 赵宇迪; 施侃
Original assignee: Shanghai Shuchuan Data Technology Co ltd
Current assignee: Shanghai Shuchuan Data Technology Co ltd
Priority date: 2022-05-30
Filing date: 2022-05-30
Publication date: 2022-08-05

Abstract

The invention relates to the technical field of computer vision, in particular to a cross-camera multi-target tracking method based on prior information, which comprises the following steps: the method comprises the steps of utilizing a YooloX model to detect pedestrians, filtering out detection frames which do not meet the size and proportion, obtaining pedestrian detection frames meeting requirements, transferring coordinates of the detection frames to a world coordinate system from two-dimensional image coordinates to obtain actual spatial positions of the pedestrians at a certain moment, and using OSNet to extract global features of the pedestrians in the detection frames. The cross-camera multi-target tracking method provided by the invention can obtain prior information only by using scene data without introducing other additional information, reduces the problem of excessive merging possibly caused by appearance characteristic distribution difference, considers the time, space and appearance characteristic information of local tracks, starts clustering from the most deterministic local tracks, dynamically strengthens condition constraint, ensures higher accuracy and can avoid the influence caused by overlapping of camera visual angles.

Description

Cross-camera multi-target tracking method based on prior information

Technical Field

The invention relates to the technical field of computer vision, in particular to a cross-camera multi-target tracking method based on prior information.

Background

With the rapid development of science and technology, the pursuit of people for safety and intelligent life is continuously met in various fields such as intelligent monitoring and intelligent transportation based on video tracking, and the pursuit is more and more emphasized by people. From initial single-camera target tracking, continuous tracking of targets cannot be completed due to problems of limited visual field, occlusion and the like, to subsequent multi-camera tracking, the problem can be solved through cooperative cooperation among cameras, and the method gradually becomes a research hotspot. At present, cross-camera multi-target tracking is widely applied to the aspects of in-store customer behavior analysis, urban traffic control, crowd behavior analysis and the like. Most of cross-camera multi-target tracking methods comprise two stages: the first stage, the local track generation stage under the single mirror, namely detect, trace each goal under the single lens, and produce the local track; and a second stage, namely a cross-mirror track matching stage, namely matching partial tracks under a single mirror in all cameras to deduce a complete and accurate cross-border track for each target.

At present, the multi-target tracking algorithm under a single mirror mainly has three types: the first algorithm is a rear-end tracking optimization algorithm matched with Hungary and Kalman filtering, and the algorithm has the advantages that the real-time performance of tracking can be guaranteed, but the algorithm depends on a detection algorithm and characteristics; the second is a multi-target tracking method of a single target tracker based on multithreading, the algorithm has the advantages that the tracking effect is good, but the algorithm has large requirements on target scale change, and simultaneously, the algorithm consumes CPU resources extremely and has low real-time performance; the third is a multi-target tracking algorithm based on deep learning, but most of the methods are in a research stage, and the land application is less.

For cross-camera multi-target tracking, in an association strategy, the problem is mostly solved by using track-to-track matching currently, some methods match local tracks of two adjacent cameras by means of camera adjacency until matching is completed, other methods use a greedy matching or hierarchical clustering method to iteratively match all tracks in all cameras, and some methods reduce a search space and improve matching efficiency by using candidate pruning and adaptive attribute selection of camera topology. In addition, some attempts to find a global solution for trajectory matching using bayesian formulas or graph models obtain the global trajectory for each target by maximizing the posterior probability or finding the network flow from the source node to the sink node; in terms of information sources, the appearance characteristics of a target are important information for effectively associating the same target under different cameras, and in addition, a method for acquiring real position information of the target by adopting a mapping relation between image coordinates and planar geographic space coordinates for association tracking is also increasingly common.

At present, the cross-camera multi-target tracking mainly has the following two problems: first, different targets appear in an indeterminate number of cameras, so the number of local tracks associated with each target is different and unknown, and it is difficult to determine how many local tracks should be matched and integrated into a global track. Secondly, when matching is performed by using the appearance features of the targets, a fixed similarity threshold is usually set, but the appearance feature distribution of the targets has large difference in different scenes, and excessive matching among the targets with large distribution can be caused by tracking by using the fixed threshold.

Disclosure of Invention

The invention aims to provide a cross-camera multi-target tracking method based on prior information, which can obtain the prior information by only utilizing scene data without introducing other additional information, reduce the problem of excessive merging possibly caused by appearance characteristic distribution difference, simultaneously consider the time, space and appearance characteristic information of local tracks, start clustering from the most deterministic local track, dynamically strengthen condition constraint, ensure higher accuracy, avoid the influence caused by camera visual angle overlapping, and solve the problem that different targets appear in an uncertain number of cameras when the current cross-camera multi-target tracking is carried out, so that the number of the local tracks associated with each target is different and unknown, and when the target appearance characteristics are utilized for matching, the problem that the target appearance characteristic distribution has larger difference under different scenes.

In order to achieve the purpose, the invention provides the following technical scheme: a cross-camera multi-target tracking method based on prior information comprises the following steps:

s1, pedestrian detection: carrying out pedestrian detection by using a YoloX model, and filtering detection frames which do not meet the size and proportion to obtain pedestrian detection frames meeting the requirements;

s2, acquiring spatial position information: transferring the coordinate of the detection frame from the two-dimensional image coordinate to a world coordinate system to obtain the actual spatial position of the pedestrian at a certain moment;

s3, pedestrian feature extraction: carrying out global feature extraction on pedestrians in the detection box by using OSNet;

s4, tracking the single mirror track: tracking the track of the pedestrian by using a multi-target tracking algorithm Sort;

s5, acquiring prior information: clustering the targets by using a clustering algorithm InfoMap, then calculating probability according to the track number of each class, and processing the probability to be in a fixed range;

s6, cross-border track clustering: and merging according to the sequence of the feature similarity from large to small until the clustering condition is not met and the number of the categories is unchanged.

Preferably, in step S1, when the detection frame is filtered, it is ensured that the human shape obtained in the pedestrian detection frame is relatively complete.

Preferably, in step S2, the camera parameters used for conversion are obtained by actual calibration and measurement.

Preferably, in step S3, a feature of 1 × 512 dimensions is calculated for each pedestrian detection frame.

Preferably, in step S4, the Sort algorithm tracks past frames and current frames, and performs correlation detection between frames of the video sequence to obtain a single-lens trajectory.

Preferably, in step S5, the average feature of each single-mirror track is calculated, and the clustering is performed by using the InfoMap based on the average features of all the single-mirror tracks, and the similarity measure during clustering uses a cosine distance.

Preferably, in step S5, based on the result after clustering, the number Nums of all the single mirror tracks is obtained through statistics, the number Ni of the single mirror tracks of each class is calculated, and the probability of each class is calculated.

Preferably, in the step S6, the categories satisfying the condition are merged by arranging the cosine distance similarities from small to large in the clustering.

Preferably, in step S6, it is determined whether the number before and after clustering changes, if the number changes, the feature is updated, and if the number of the before and after categories does not change, the feature clustering is performed.

Preferably, in step S6, for the objects merged into the same category, the average features are calculated until the number of clusters is not changed any more, and the clustering is ended.

Compared with the prior art, the invention has the following beneficial effects:

the cross-camera multi-target tracking method provided by the invention can obtain prior information only by using scene data without introducing other additional information, reduces the problem of excessive merging possibly caused by appearance characteristic distribution difference, considers the time, space and appearance characteristic information of local tracks at the same time, starts clustering from the most deterministic local track, dynamically strengthens condition constraint, ensures higher accuracy, can avoid the influence caused by camera visual angle overlapping, and solves the problems that different targets appear in an uncertain number of cameras when the current cross-camera multi-target tracking is carried out, so that the number of the local tracks associated with each target is different and unknown, and the distribution of the target appearance characteristics in different scenes has larger difference when the target appearance characteristics are used for matching.

Drawings

FIG. 1 is a flow chart of a multi-target tracking method of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

A cross-camera multi-target tracking method based on prior information comprises the following steps:

s1, pedestrian detection: carrying out pedestrian detection by using a YooloX model, and filtering out detection frames which do not meet the size and proportion to obtain pedestrian detection frames meeting the requirements;

Specifically, as shown in fig. 1, an embodiment of a multi-camera multi-target tracking method based on prior information includes the following steps:

s101: and carrying out pedestrian detection on the video frames of the multiple cameras by using the YooloX model to obtain detection frame coordinates, and filtering out detection frames which do not meet the size and proportion so as to ensure that the obtained human figure is relatively complete, thereby being beneficial to feature extraction and feature comparison of a pedestrian re-identification model.

S102: and coordinates of the detection frame are converted into a world coordinate system from two-dimensional image coordinates to obtain the actual spatial position of the target at a certain moment, and the camera parameters used for conversion are obtained by actual calibration and measurement.

S103: the pedestrian re-identification model OSNet performs pedestrian feature extraction, and for each pedestrian detection frame, 1 × 512-dimensional features can be calculated;

s104: tracking the human target under the single mirror by using a multi-target tracking algorithm Sort to obtain a track under the single mirror;

s105: the method for obtaining the prior probability after clustering by using the clustering algorithm InfoMap comprises the following substeps:

s1051: calculating the average characteristic of each single-mirror track based on the pedestrian characteristic of the detection frame obtained in the step S103 and the single-mirror track obtained in the step S104;

s1052: using InfoMap clustering based on all single-mirror track average characteristics, using cosine distance for similarity measurement during clustering, and setting a threshold value to be 0.5;

s1053: based on the clustered result, counting to obtain the number Nums of all single mirror tracks and calculating the number Ni of the single mirror tracks of each class (i represents the ith class), calculating the probability of each class, and in order to avoid the situation that the dynamic threshold obtained by calculation is too small, rescaling Pi to a target interval, the method comprises the following substeps:

s10531: the probability Pi for each class is calculated as follows:

s10532: determining a threshold interval [ r1, r2], wherein 0< ═ r1< r2< ═ 1, and calculating to obtain a maximum probability maxP and a minimum probability minP;

s10533: the scaling factor k is calculated as follows:

s10544: the scaled probabilities PRi are calculated as follows:

PRi＝r1+k*Pi

the YooloX pedestrian detection model and the OSNet pedestrian feature extraction model are known public deep learning models, the Sort is a known public tracking algorithm, and the InfoMap is a known public clustering algorithm.

S106: calculating cosine distance similarity between the single-mirror track average features;

s107: characteristic clustering, which is to arrange according to the cosine distance similarity from small to large, and combine the classes meeting the conditions, if a relatively loose threshold thresh is set, the classes to be combined during clustering need to meet the following sub-conditions:

s1071: judging that the two classes to be merged meet a dynamic threshold, wherein the step of calculating the dynamic threshold comprises the following substeps:

s10711: the probability sets of the two classes are represented as PC1 ═ { PRm1, …, PRn1} and PC2 ═ PRm2, …, PRn2}, and average probabilities Pm1 and Pm2 of the two classes are calculated respectively;

s10712: the threshold for merging two classes is as follows, requiring that the band merge two classes:

thresh _dy ＝thresh*min(Pm1，Pm2)

s1072: the same camera and the same time characteristics do not exist in the category to be merged;

s1073: whether the categories to be merged have coincidence in spatial positions at the same time or not;

s1074: when the categories to be merged have time difference, the spatial movement speed is satisfied (for example, the normal walking speed of a person is 1-1.5 m/s);

s108: and judging whether the number of the front and the back clusters is changed, if so, executing S109, and if not, executing S107.

S109: updating the characteristics, calculating the average characteristics of the characteristics merged into the same category, repeating the processes from S106 to S108 until the clustering quantity is not changed any more, and finishing clustering.

In summary, the invention provides a cross-camera multi-target tracking method based on prior information, which aims to solve the problem of poor robustness in different scenes caused by setting a fixed threshold value when target appearance characteristics are utilized in the existing method, takes pedestrian target tracking as an example, takes a plurality of camera videos in a certain time period as input, firstly processes a single-lens pedestrian target, detects pedestrians in the videos by using a YooloX model, tracks by using a multi-target tracking algorithm Sort, and extracts characteristics of the detected pedestrians based on an OSNet model; then, clustering all tracks obtained under a single mirror by using an InfoMap clustering algorithm, then calculating the probability of each class, and processing the probability to be within a fixed range; and finally, sequencing all the single mirrors, and carrying out batch clustering and merging on the single mirror tracks meeting the dynamic threshold until the conditions are not met and the clustering is stopped.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A cross-camera multi-target tracking method based on prior information is characterized in that: the method comprises the following steps:

2. The cross-camera multi-target tracking method based on the prior information as claimed in claim 1, characterized in that: in step S1, when the detection frames are filtered, it is ensured that the human shape obtained in the pedestrian detection frame is relatively complete.

3. The cross-camera multi-target tracking method based on the prior information as claimed in claim 1, characterized in that: in step S2, the camera parameters used for conversion are obtained by actual calibration and measurement.

4. The cross-camera multi-target tracking method based on the prior information as claimed in claim 1, characterized in that: in step S3, a feature of 1 × 512 dimensions is calculated for each pedestrian detection frame.

5. The cross-camera multi-target tracking method based on the prior information as claimed in claim 1, characterized in that: in step S4, the Sort algorithm performs correlation detection between frames of the video sequence by tracking the past frame and the current frame, so as to obtain a track under a single mirror.

6. The cross-camera multi-target tracking method based on the prior information as claimed in claim 1, characterized in that: in step S5, the average feature of each single-mirror track is calculated, and the InfoMap clustering is used based on the average features of all the single-mirror tracks, and the cosine distance is used for similarity measurement during clustering.

7. The cross-camera multi-target tracking method based on the prior information as claimed in claim 1, characterized in that: in step S5, based on the result after clustering, the number Nums of all the single mirror tracks is obtained through statistics, the number Ni of the single mirror tracks of each class is calculated, and the probability of each class is calculated.

8. The cross-camera multi-target tracking method based on the prior information as claimed in claim 1, characterized in that: in step S6, the categories that satisfy the condition are merged by arranging the cosine distance similarities from small to large in the clustering.

9. The cross-camera multi-target tracking method based on the prior information as claimed in claim 1, characterized in that: in step S6, it is determined whether the number before and after clustering changes, and if the number changes, the feature is updated, and if the number of categories before and after clustering does not change, the feature clustering is performed.

10. The cross-camera multi-target tracking method based on the prior information as claimed in claim 1, characterized in that: in step S6, the average features of the clusters that are merged into the same category are calculated until the cluster number is no longer changed, and the clustering is ended.