CN113409356A

CN113409356A - Similarity calculation method and multi-target tracking method

Info

Publication number: CN113409356A
Application number: CN202110695292.5A
Authority: CN
Inventors: 储琪; 俞能海; 刘斌; 刘乾坤; 顾建军; 寄珊珊
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2021-06-23
Filing date: 2021-06-23
Publication date: 2021-09-17

Abstract

The invention discloses a similarity calculation method and a multi-target tracking method.A neighbor of a target is calculated for each target in each video frame, a vertex set is constructed by utilizing the appearance characteristics of the target and the neighbor of the target, and a directed edge set is calculated by utilizing the correlation among the targets, so that a directed graph is constructed; and for adjacent video frames, performing matching calculation by using the directed graphs of all targets in the two video frames to obtain a similarity calculation result. The invention simultaneously utilizes the appearance characteristics of the targets and the relative position characteristics between the targets, and the topological structure between the targets is well coded into the directed graph. The invention improves the accuracy and precision rate of multi-target tracking; and can retrieve lost targets when a certain target is seriously shielded by other targets; more tracking targets can be obtained, and the number of lost targets is reduced.

Description

Similarity calculation method and multi-target tracking method

Technical Field

The invention belongs to the field of computer vision, and particularly relates to a similarity calculation method and a multi-target tracking method.

Background

By multi-target tracking, it is meant that given a video, the algorithm can output a location box of the target of interest in the video. It should be noted that the number of objects in the video is not fixed. During the tracking process, each target is also assigned a unique identity information (i.e. a number, denoted by ID). Multi-target tracking has wide applications, such as intelligent monitoring, automatic driving, and the like. With the rapid development of target detection, multi-target tracking methods based on detection gradually become mainstream. In the method, the multi-target tracking problem is mainly solved by data association.

In general, a robust similarity model is the key to solving the data association. In the existing method, the similarity between the targets is mostly calculated based on the characteristics of the target individuals, namely, the similarity model does not consider the correlation between the targets. Although the feature representation capability of the target is stronger and stronger with the development of the deep learning technology, the similarity calculated based on the individual features of the target is also somewhat robust. However, this calculation method still has limitations in complex scenarios. For example, when tracked objects belong to the same category (e.g., pedestrian tracking, vehicle tracking, etc.), the objects have certain similarities in appearance, and frequent occlusion exists inevitably between the objects. As shown in fig. 1, a complex scenario under pedestrian tracking is demonstrated. Part (a) of fig. 1 shows two adjacent pictures, and it can be seen that clothes of different pedestrians are relatively similar in the scene, and there is a certain occlusion between pedestrians. Part (b) of fig. 1 shows the similarity score calculated based on the target individual features, and the basic flow is as follows: firstly, extracting Individual features (industrialized reproduction) of a single target, and then performing Similarity calculation (Similarity Measure); it can be seen that the calculated similarity score is relatively low when the pedestrian is occluded, and relatively high when different pedestrians are similarly dressed. Therefore, in such a complex scenario, the similarity score calculated based on the target individual feature is not reliable enough.

Appearance features of objects are widely used in object tracking. Earlier work utilized manually designed appearance features for multi-target tracking. For example, Yamaguchi et al used raw pixel templates (raw pixel templates) in the article "white are you with and where are you in CVPR 2011", and Izadinia et al used Color histograms (Color histograms) and gradient Histograms (HOG) in the article "(MP) 2: multiple-pest multiple parts tracker. ECCV 2012". With the development of deep learning in recent years, appearance features extracted by using a Convolutional Neural Network (CNN) are widely applied to the field of multi-target tracking.

Motion information of a target is also widely applied to the field of multi-target tracking. These methods using motion information are basically based on an assumption that the motion of the object is smooth and slow. Milan et al, in the article "Continuous energy development for multiple target tracking. TPAMI 2013" designed a linear motion model, Yang et al, in the article "multiple tracking by online learning of non-linear motion patterns and robust application models. CVPR 2012" designed a non-linear motion model. However, the motion of the object in the video is not determined only by the object itself but also by the shooting device. During the shooting process, the device inevitably has a certain jitter, which is generally random and unpredictable. Therefore, it is difficult to solve the camera shake problem simply by relying on a motion model.

The above descriptions are based on the design and utilization characteristics of target individuals, and do not utilize the correlation between targets. Some of the current efforts have been directed to multi-target tracking using the interrelationship between targets. Sadeghian et al designed an Occupancy Map (Occupanacy Map) by partitioning a picture into a grid of fixed size in the article "Tracking the untrackable: left to track multiple crops with long-term dependencies. ICCV 2017". And when a target exists in a certain grid, setting the value corresponding to the placeholder map as 1, otherwise, setting the value as 0. It can be seen that this processing method only roughly records the distribution of objects, and the grid with value 1 in the placeholder map cannot distinguish different objects and the number of objects. Xu et al, in the article "Spatial-temporal relationship networks for multi-object tracking. ICCV 2019", extract the interrelationships between objects using a Relational Network. Specifically, the mutual position relationship between the objects is encoded into a weight, and then the weight is used to fuse the appearance characteristics of other objects in the current frame. It should be noted that the weights utilized by each object in fusing the appearance characteristics of the other objects are different. But this is not very interpretable and ignores the topology between objects.

Disclosure of Invention

The invention aims to provide a similarity calculation method and a multi-target tracking method, which can improve the robustness of a similarity model in multi-target tracking and ensure the tracking effect of the multi-target tracking in a complex scene.

The purpose of the invention is realized by the following technical scheme: a similarity calculation method for multi-target tracking, comprising:

for each target in each video frame, calculating the neighbor of the target, constructing a vertex set by using the appearance characteristics of the target and the neighbor of the target, and calculating a directed edge set by using the correlation between the targets, thereby constructing a directed graph.

And for adjacent video frames, performing matching calculation by using the directed graphs of all targets in the two video frames to obtain a similarity calculation result.

Further, for the target set in the t-th frame is expressed as

Wherein the ith target is represented as

Representing a position frame of the ith target, wherein four elements are the coordinate of the upper left corner of the ith target and the width and the height of the position frame respectively;

representing picture blocks truncated in the t-th frame according to the position frame, I_tIndicating the number of objects in the t-th frame.

Further, K neighbors of the targets are obtained according to the distance between the targets, wherein K is the total number of the neighbors; for the t frame, with the ith target

For anchor points, the target

As its own 0 th neighbor, the target

And its neighbor forming set

Wherein the content of the first and second substances,

is an object

Is adjacent to the k-th neighbor.

Further, for the ith target in the tth frame

The constructed directed graph is represented as

Wherein the vertex set

The definition is as follows:

in the formula (I), the compound is shown in the specification,

to represent

Of the appearance characteristics phi_ACNN(. cndot.) represents the forward function of the convolutional neural network used to extract appearance features.

For directed edge sets

First, by

Representing anchor points

And its neighboring neighbor

Relative position vector between:

in the formula, w^tAnd h^tIs the width and height of the t-th frame, phi_RP(-) is a function that calculates the relative position between the targets based on the location box.

Using relative position encoders to align relative position vectors

Is transformed to obtain

Thereby obtaining a set of directed edges

In the formula, phi_RPE(. cndot.) is a relative position encoder.

Further, the hard matching is carried out by utilizing the directed graphs of the targets in the two video frames, and the hard matching comprises the following steps:

given the directed graphs of two objects for adjacent video frames, frame t-1 and frame t

And

first, a similarity matrix is calculated

The elements in the k-th row and k' -th column of the matrix are calculated as follows:

in the formula (I), the compound is shown in the specification,

representing element subtraction between feature vectors, | · non-²Representing the squaring of elements in a pair vector, [, ]]Means that two vectors are spliced together, phi_BC(. cndot.) represents the forward function of the two classifiers.

Finally, the following are obtained:

further, soft matching is carried out by utilizing directed graphs of objects in two video frames, and the soft matching method comprises the following steps:

on the basis of hard matching, firstly performing proximity alignment, and then calculating the similarity, which is expressed as:

in the formula (I), the compound is shown in the specification,

is a similarity matrix S_i,jRemoving the first row and the first column to obtain a matrix; phi is a_LA(. cndot.) is a linear distribution function for completing task distribution and returning the maximum total similarity sum according to the input similarity matrix.

The multi-target tracking method applies the similarity calculation method to the existing multi-target tracking method based on data association to replace a similarity model in the multi-target tracking method.

Further, comprising: finding the lost target in the current frame by using the information of the previous frame, the steps are as follows:

for the ith target in the t-1 frame

If lost at the t frame, then use the ith target

Its k-th neighbor

Relative position of each other

And

estimate the location frame of the ith target in the t-th frame

In the formula (I), the compound is shown in the specification,

is indicative of phi_RPInverse function of phi_RP(-) is a function that calculates the relative position between the targets based on the location box;

to represent

The corresponding target in the t-th frame.

For the ith target

All the K neighbors estimate the position frame of the ith target in the t frame in the mode, and the final position frame of the ith target in the t frame is calculated in the averaging mode

Further, a final position frame of the ith target in the tth frame is obtained

Then based on

Several candidate boxes are sampled by a gaussian distribution.

For any candidate frame sampled

By using

To represent

One candidate object in the t-th frame,

show according to location box

Picture blocks truncated in the t-1 th frame and constructing a directed graph

Then, a graph is obtained

And

the similarity between them.

And (3) extracting the candidate detection result with the highest similarity score from all the candidate detection results, and if the similarity score is larger than a set threshold value, taking the candidate target with the highest score as the candidate target

Tracking results in the t-th frame.

The invention has the beneficial effects that: the invention simultaneously utilizes the appearance characteristics of the targets and the relative position characteristics between the targets, and the topological structure between the targets is well coded into the directed graph. The invention improves the accuracy and precision rate of multi-target tracking; and can retrieve lost targets when a certain target is seriously shielded by other targets; more tracking targets can be obtained, and the number of lost targets is reduced.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

FIG. 1 is a diagram illustrating the effect of the prior art in a complex application scenario;

fig. 2 is a schematic diagram illustrating an effect of the similarity calculation method according to the embodiment of the present invention;

FIG. 3 is a schematic diagram of a constructed directed graph provided by an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a relative position encoder according to an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of a second classifier according to an embodiment of the present invention;

fig. 6 is a schematic diagram illustrating a change of a directed graph in a missing detection situation according to an embodiment of the present invention;

fig. 7 is a schematic diagram of retrieving a lost target based on a directed graph according to an embodiment of the present invention;

FIG. 8 illustrates tracking performance at different K values provided by embodiments of the present invention;

fig. 9 is a schematic diagram of a visual tracking result provided in the embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

The similarity calculation method for multi-target tracking can be used as a similarity model to improve the robustness of the similarity model in multi-target tracking, and particularly the similarity model is a graph similarity model which has universality and can replace the similarity model in the conventional multi-target tracking.

As shown in fig. 2, which illustrates the main principle of the similarity calculation method for multi-target tracking, the scenario shown in part (a) of fig. 1 in fig. 2 is taken as an example, and mainly includes:

1. for each target in each video frame, calculating the neighbor of the target, then using the appearance characteristics of the target and the neighbor to construct a vertex set, and using the correlation between the targets to calculate a directed edge set, thereby constructing a directed Graph, namely 'Graph Representation' in fig. 2.

2. For adjacent video frames, Matching calculation is performed by using the directed graphs of the targets in the two video frames to obtain a similarity calculation result, namely 'Graph Matching' in fig. 2.

For convenience of understanding, the following detailed description is made on the above scheme of the present invention, and the scheme is mainly described by two parts, wherein the first part mainly describes data association in multi-target tracking; the second section mainly introduces the principle of the graph similarity model.

Firstly, data association.

Expressed as target set in the t-th frame

Wherein the ith target is represented as

Position box representing ith target, four elements therein

The horizontal and vertical coordinates and the width and height of the upper left corner of the position frame of the ith target are respectively;

representing a picture block truncated in the t-th frame according to the location box of the I-th object, I_tIndicating the number of objects in the t-th frame.

The data association is performed in two adjacent framesRespectively, the t-1 th frame and the t-th frame. When solving the data association, a cost matrix needs to be provided

Element m of ith row and jth column in cost matrix_i,jRepresenting objects

And

the cost is calculated as follows:

in the above formula, phi_CI(-, denotes a function that calculates a cost based on the target individual characteristics.

Most of the existing methods mainly learn the characteristics of better target individuals or design better cost function phi_CITo improve the robustness of the similarity model, but this does not take into account the interrelationship between objects.

As shown in part (a) of fig. 3, the target detection results in two adjacent frames are shown, and the numbers in the upper left corner of the position frame of the target are used to distinguish different detection results. Taking the 1 st target (i.e. the leftmost pedestrian) as an example, it can be seen that

Is severely shielded and in appearance

There is a relatively large difference.

In order to utilize the mutual relation between targets, the invention designs a graph similarity model, and the cost calculation mode based on the graph is as follows:

in the above formula, the first and second carbon atoms are,

expressed as a target

Created directed graph (implementation details see introduction below), phi_GI(-) represents a function that computes the cost based on the feature representation of both graphs. The interrelationship between objects is embedded in the feature representation of the graph.

Secondly, a graph similarity model.

This section is presented mainly from three aspects: obtaining target neighbor, constructing a directed graph and matching the graph.

1. And acquiring target neighbors.

To create a directed graph

First needs to be acquired in the t frame

K neighbors (K is the total number of neighbors, the value can be set by itself), the target and its neighbors must be in the same frame. There are many ways to obtain the neighbor, which can be referred to the conventional techniques, but the invention is not limited thereto.

By some metric, the distance between the targets can be calculated. The distance is a general expression, and includes, but is not limited to, the Euclidean distance between the center points of the target location frames. In the embodiment of the invention, the neighbor is obtained by using the Euclidean distance between the central points of the target position frames. Using ordered sets

Representing objects

And a set of its neighbors, wherein

Is an object

Is adjacent to the k-th neighbor. In addition, define

Is an anchor point. For simplifying writing, it is provided

Is its own 0 th neighbor and thus

Taking part (a) of fig. 3 as an example, when k is 2,

2. and constructing a directed graph.

For the ith target in the t frame

The constructed directed graph is represented as

Directed graph

Is based on

And constructing the structure, wherein the structure consists of K +1 vertexes and K +1 directed edges. Part (b), (c) and (d) of fig. 3 respectively show directed graphs of different objects in two adjacent frames, and although the nodes in the directed graphs may be the same, their edges are different.

Vertex set

The definition is as follows:

in the above formula, the first and second carbon atoms are,

to represent

The appearance of the glass substrate is as follows,

to represent

Is used to frame the picture block cut in the t-th frame_ACNN(. The) represents a forward transfer function of the convolutional neural network used to extract appearance features; the convolutional neural network may be implemented in a conventional manner.

In order to exploit the interrelationship between objects, first the mutual position between objects is defined, using

Representing anchor points

And its neighboring neighbor

Relative position vector between:

in the above formula, the first and second carbon atoms are,

are respectively as

The horizontal and vertical coordinates and the width and height w of the upper left corner of the position frame^tAnd h^tIs the width and height of the t-th frame, phi_RP(-) is a function that calculates the relative position between the targets based on the location box;

using relative position encoders to align relative position vectors

Is transformed to obtain

Thereby obtaining a set of directed edges

In the above formula, the first and second carbon atoms are,

to represent

Element of (5), phi_RPE(. cndot.) is a relative position encoder.

Illustratively, relative position vector

The vector can be an 8-dimensional vector, the 8-dimensional relative position vector can be transformed into a high-dimensional relative position vector by using a method designed by Vaswani in the article "Attention all you needed. Fig. 4 schematically shows the structure of the relative position encoder, which is mainly composed of an fc (full connected) layer, and also utilizes a Batch Normalization (BN) layer and a ReLU activation function.

3. And (6) matching the graphs.

For the adjacencyVideo frames, i.e. frame t-1 and frame t, given directed graphs of two objects

And

first, a similarity matrix is calculated

Elements of the k-th row and k' -th column of the matrix

The calculation is as follows:

in the above formula, the first and second carbon atoms are,

representing element subtraction between feature vectors, | · non-²The representation squares the elements in the vector, [. The representation concatenates the two vectors, phi_BC(. denotes the fronthaul function of the Binary Classifier (BC); FIG. 5 schematically shows the structure of the Binary Classifier.

Finally, the directed graph

And

similarity between s_i,jIt can be calculated in the following way:

the matching mode is hard matching. In practical application, the detection result is not perfect, and missing detection and false alarm exist, so that the robustness of the similarity score obtained through the calculation of the formula is not high.

As shown in fig. 6, the target

Is missed in the t frame, the corresponding constructed directed graph

And

changes also occur. The similarity score obtained when directly performing a hard match is not reliable because the order of the neighbors changes, i.e., there is no alignment between neighbors.

In order to solve the problem that the neighbors are not aligned, a soft matching scheme is further provided, namely on the basis of hard matching, neighbor pair is firstly carried out, and then similarity is calculated

Expressed as:

in the above formula, the first and second carbon atoms are,

is a similarity matrix S_i,jRemoving the first row and the first column to obtain a matrix; phi is a_LA(. is a linear distribution function which is obtained by modifying Hungarian algorithm and used for completing task distribution and returning the maximum total similarity sum according to the input similarity matrix.

Soft matching may align K neighbors of an anchor point as compared to hard matching. It should be noted, however, that the similarity score obtained by soft matching is always not less than the similarity score obtained by hard matching, i.e. there is always similarity

This is true. It can thus be seen that soft matching has a positive effect when the two anchors in the two graphs are the same target (positive samples) and a negative effect when the anchors in the two graphs are different targets (negative samples). Nevertheless, since the appearance features of the objects and the relative position features between the objects are encoded into the directed graph, so that the feature representation capability of the directed graph is good, the negative influence of the soft matching on the latter is basically negligible. Finally, the aforementioned cost formula can be rewritten as:

preferably, the embodiment of the invention also provides a multi-target tracking method capable of retrieving the lost target.

In the field of multi-target tracking, one target may be occluded by other targets. When the occlusion is severe, it is difficult for the detector to detect the occluded object, and as shown in fig. 6, the rightmost object is lost at the t-th frame. Because the graph similarity model designed by the invention utilizes the topological structure between the targets, the lost targets can be found back by utilizing the graph similarity model in the invention, as shown in fig. 7.

For the ith target in the t-1 frame

If lost at the t frame, then use the ith target

Its k-th neighbor

Relative position of each other

And

estimate the location frame of the ith target in the t-th frame

In the above formula, the first and second carbon atoms are,

is indicative of phi_RPInverse function of (phi)_RP(-) is a function that calculates the relative position between the targets based on the location box;

to represent

The corresponding target in the t-th frame.

For the ith target

Further, the final position frame of the ith target in the obtained t frame

Then based on

Several candidate boxes are sampled by a gaussian distribution. For any candidate frame sampled

By using

To represent

One candidate object in the t-th frame,

show according to location box

Picture blocks truncated in the t-1 th frame and constructing a directed graph

Then, a graph is obtained

And

the similarity between them; selecting the candidate object with the highest similarity score from all the candidate objects, and if the similarity score is larger than a set threshold value, taking the candidate object with the highest similarity score as the candidate object

Tracking results in the t-th frame.

In the above-mentioned solution of the embodiment of the present invention, a feature representation (i.e. directed graph) of a graph is designed, and the feature representation not only utilizes the features of target individuals, but also utilizes the interrelation between targets. This correlation is represented by a directed graph, which is also in fact a topology between the targets; a characteristic matching mode of the graph is also designed, and a more robust similarity score can be obtained through a reasonable matching mode.

On the other hand, in order to explain the effect of the above scheme of the embodiment of the invention, the graph similarity model is applied to the existing multi-target tracking method based on data association, the similarity model therein is replaced, and the validity of the graph similarity model is verified through experiments.

Experiments were performed on motchelinge (https:// motchallange. net /) to analyze the merit and positive effects of the graph similarity model. The data sets used included MOT16 and MOT17, with the following evaluation indices:

1) Multi-Object Tracking Accuracy (MOTA), the higher the index the better.

2) Multi-Object Tracking Precision (MOTP), the higher the index the better;

3) the frequency of the same ID assigned to the same object (IDF 1, the higher the index is, the better the index is;

4) the number of most Tracked objects (MT), the index is as high as possible;

5) the number of most missing objects (ML), the lower the indicator, the better;

6) the number of times that the target ID changes (IDS), the lower the index, the better;

7) the number of discontinuous target tracks (Frag), the lower the index, the better;

8) the number of missed targets (FN), the lower the index is, the better;

9) the lower the False alarm count (FP), the better the indicator.

1. Details of the experiment.

In the experiment, all picture blocks were scaled to a size of 64 × 128. The convolutional neural network used to extract appearance features is implemented based on ResNet-34(Deep residual learning for image recognition. CVPR 2016): the last FC layer is removed to obtain 2 × 4 × 256 features, then the features are transformed into 2048-dimensional vectors, and finally the vectors are input into an FC layer to obtain 256-dimensional appearance feature vectors. The relative position characteristic of the relative position encoder RPE output is also 256 dimensions.

Appearance rollThe product neural network, the relative position encoder and the classifier are trained end to end, and 30 times on a training set by using a binary cross entropy loss function. Input to the classifier during training

The method comprises the following steps of dividing a positive sample into a negative sample: when anchor point

And

are the same target and are close to each other

And

when the target is the same, the input is a positive sample, and the other samples which do not meet the requirement are negative samples. The learning rate was initialized to 0.002, and the learning rate was reduced to 1/2 for each 10 training sessions. In addition, the problem of unbalance of positive and negative samples is solved by using online difficult sample mining.

2. Ablation experiment

In order to verify the validity of the GSM (graph similarity model) in the invention, a basic model (using the GSM) is also designed

Representation).

There is no RPE (relative position encoder) and the classifier only classifies according to the appearance features of the object (determines whether two appearance features belong to the same object). Two trackers are further designed, and the only difference between the two trackers is the difference between the similarity models. The two trackers track by correlating objects in matching adjacent frames. The 7 videos in the MOT16 training set were further divided into a training subset and a verification subset. The verification subset includes MOT16-09 and MOT16-10, the remainderThe 5 videos constitute a training subset. The tracking results of the different models on the validation subset are shown in table 1.

Table 1: tracking performance on verification subsets

In table 1, the superscript numbers of the different models represent the number K of neighbors used. The first two rows compare the performance of multi-target tracking without neighbor, and the tracking performance of the two models is basically the same. This is understandable because of the foregoing

It can be seen from the calculation formula that the relative position of the target to itself is an 8-dimensional all-zero vector, which means that GSM only utilizes appearance features when the number of neighbors K is 0.

The middle four rows compare the effect of hard matching (subscript h) and soft matching (subscript s) on tracking performance. Five neighbors were used. For the

And (4) modeling, wherein the constructed graph only has nodes and no edges. It can be seen that the hard match and soft match pairs

The models all have negative effects for the following reasons: (1) for two graphs with different anchor points, if the same target exists in the adjacent neighbors, the similarity of the two graphs can be increased by the two matching modes; (2) for two graphs with different anchor points, the negative effect of soft matching is not negligible because only appearance features and no relative position features are used at this time. For the GSM model, when 5 neighbors are used, the hard matching instead degrades tracking performance (compare GSM to GSM)⁰And

). The reason is that when the anchor points of the two graphs are the same target and the neighbors are not aligned, there is a hard matchThe resulting similarity will be low. Contrast to GSM⁰And

it can be seen that

The IDF1 was 5.9% higher and the IDS was lower, indicating that

When tracking, the same target is assigned the same ID more frequently.

The last line shows the influence on the tracking performance after finding the lost target, and the method is used

And (4) showing. When retrieving objects, 64 candidate objects are sampled for each missing object. It can be seen that

Is lower, indicating that a partially lost object has been retrieved. But instead of the other end of the tube

FP of (a) is also higher, indicating that some of the retrieval targets failed.

In order to find a proper K value, a lot of experiments are carried out by using soft matching, as shown in FIG. 8, and overall, when K is more than or equal to 5, the MOTA is slightly improved, and the IDF1 is basically kept unchanged. The reason is that: with the increase of the K value, when the anchor points in the two graphs are the same target, the obtained similarity score is higher (more reliable); conversely, when the anchor points in the two figures are different targets, the resulting similarity score is also higher (less reliable). Such positive and negative effects counteract each other. To trade off tracking performance against time, K is set to 5. On the verification set, a directed graph is created, and the similarity score between the two graphs is calculated, which takes 0.15ms and 0.03ms respectively. Will be provided with

After the model was replaced with the GSM model, the tracking speed dropped from 93.7FPS to 61.5 FPS.

Some tracking results are visualized in fig. 9, first row: n aive⁰Tracking a result; second row

The tracking result of (2);

the tracking result of (1). In frame 27 there are two objects within the dashed box, denoted object 2 on the right and object 8 on the left. At frame 29, the object 2 is partially occluded by the object 8, resulting in the object 2 not being detected. However, the position of the target 2 is controlled

Is well estimated. At frame 49, target 2 is completely occluded by target 8, Naive⁰Erroneously recognizing target 8 as target 2, but

And

the target is correctly identified (rectangle box to the right in the middle).

3. Results on MOTChalleng

The GSM model in the invention is applied to tracker (Tracking with out balls and places. ICCV 2019) with the best performance, which is published currently and is recorded as GSM_Tracktor. In addition, the algorithms MTDF (Multi-level cooperative fusion of gm-phd filters for on-line multiple human Tracking. IEEE Transactions on Multimedia2019), STAM (on-line multiple Tracking using cn-based single object Tracking with spatial-temporal Tracking. ICCV2017), DMMOT (on-line multiple-object Tracking with temporal Tracking around ECCV 2018), and AMIR (Tracking along with Tracking around to Tracking around on-line multiple Tracking around ECCV 2018) are also usedMultiple documents with long-term dependencies. ICCV2017), STRN (Spatial-temporal correlation networks for multiple-object tracking. ICCV 2019), DMAN (Online multi-object tracking with dual-description mapping. ECCV 2018), HAM-SADF (Online multi-object tracking with historical addressing mapping. AVSS 2018), MOTDT (Real-time multi-object tracking with historical mapped selection and characterization mapping. AVSS 2018), and FAMNT (Real-time multi-object tracking with historical mapped selection and characterization mapping. 2018). And tested on test sets of MOT16 and MOT 17. The test results are shown in table 2.

Table 2: tracking performance of different tracking algorithms on MOTChalnge

At MOT16, GSM in addition to MOTP, FP, and IDS_TracktorThe best tracking performance was achieved, among other things, and ranked second (475) on the IDS, just a little higher than the IDS of the first (473). GSM in contrast to Tracktor_TracktorThe improvements were 3.6% and 5.7% on MOTA and IDF1, respectively, and the IDS was also reduced by 30.4%. At MOT17, see GSM as a whole_Trackto_rThe best performance is also achieved. GSM in contrast to Tracktor_TracktorBetter results were obtained in almost all indicators. In particular, the improvements on MOTA and IDF1 were 2.9% and 5.5%, respectively, and IDS was also reduced by more than 20%. GSM (Global System for Mobile communications)_TracktorThe better performance is obtained on IDS and IDF1, which shows that the GSM model has good characteristic representation capability and the calculated similarity is more robust.

Through the above description of the embodiments, it is clear to those skilled in the art that the above embodiments can be implemented by software, and can also be implemented by software plus a necessary general hardware platform. With this understanding, the technical solutions of the embodiments can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions for enabling a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods according to the embodiments of the present invention.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A similarity calculation method for multi-target tracking, comprising:

2. The similarity calculation method for multi-target tracking according to claim 1, wherein the similarity is expressed for the target set in the t-th frame

Wherein the ith target is represented as

3. The similarity calculation method for multi-target tracking according to claim 2, characterized in that K neighbors of the targets are obtained according to the distance between the targets, wherein K is the total number of the neighbors; for the t frame, with the ith target

For anchor points, the target

As its own 0 th neighbor, the target

And its neighbor forming set

Wherein the content of the first and second substances,

is an object

Is adjacent to the k-th neighbor.

4. The similarity calculation method for multi-target tracking according to claim 3, wherein for the ith target in the tth frame

The constructed directed graph is represented as

Wherein the vertex set

The definition is as follows:

in the formula (I), the compound is shown in the specification,

to represent

For directed edge sets

First, by

Representing anchor points

And its neighboring neighbor

Relative position vector between:

Using relative position encoders to align relative position vectors

Is transformed to obtain

Thereby obtaining a set of directed edges

In the formula, phi_RPE(. cndot.) is a relative position encoder.

5. The similarity calculation method for multi-target tracking according to claim 4, wherein the hard matching using the directed graph of each target in two video frames comprises:

And

first, a similarity matrix is calculated

in the formula (I), the compound is shown in the specification,

Finally, the following are obtained:

6. the similarity calculation method for multi-target tracking according to claim 5, wherein the soft matching is performed by using the directed graphs of the targets in the two video frames, and comprises the following steps:

in the formula (I), the compound is shown in the specification,

7. A multi-target tracking method is characterized in that the method of any one of claims 1 to 6 is applied to an existing multi-target tracking method based on data association to replace a similarity model in the existing multi-target tracking method.

8. The multi-target tracking method according to claim 7, comprising: finding the lost target in the current frame by using the information of the previous frame, the steps are as follows:

for the ith target in the t-1 frame

If lost at the t frame, then use the ith target

Its k-th neighbor

Relative position of each other

And

estimate the location frame of the ith target in the t-th frame

In the formula (I), the compound is shown in the specification,

to represent

The corresponding target in the t-th frame.

For the ith target

All the K neighbors estimate the position frame of the ith target in the t frame in the above mode, and the t frame is calculated in an averaging modeLast position frame of ith target