CN113409356A - Similarity calculation method and multi-target tracking method - Google Patents
Similarity calculation method and multi-target tracking method Download PDFInfo
- Publication number
- CN113409356A CN113409356A CN202110695292.5A CN202110695292A CN113409356A CN 113409356 A CN113409356 A CN 113409356A CN 202110695292 A CN202110695292 A CN 202110695292A CN 113409356 A CN113409356 A CN 113409356A
- Authority
- CN
- China
- Prior art keywords
- target
- frame
- similarity
- targets
- tracking
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/20—Analysis of motion
- G06T7/246—Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
- G06T7/248—Analysis of motion using feature-based methods, e.g. the tracking of corners or segments involving reference images or patches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10016—Video; Image sequence
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20021—Dividing image into blocks, subimages or windows
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computing Systems (AREA)
- General Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a similarity calculation method and a multi-target tracking method.A neighbor of a target is calculated for each target in each video frame, a vertex set is constructed by utilizing the appearance characteristics of the target and the neighbor of the target, and a directed edge set is calculated by utilizing the correlation among the targets, so that a directed graph is constructed; and for adjacent video frames, performing matching calculation by using the directed graphs of all targets in the two video frames to obtain a similarity calculation result. The invention simultaneously utilizes the appearance characteristics of the targets and the relative position characteristics between the targets, and the topological structure between the targets is well coded into the directed graph. The invention improves the accuracy and precision rate of multi-target tracking; and can retrieve lost targets when a certain target is seriously shielded by other targets; more tracking targets can be obtained, and the number of lost targets is reduced.
Description
Technical Field
The invention belongs to the field of computer vision, and particularly relates to a similarity calculation method and a multi-target tracking method.
Background
By multi-target tracking, it is meant that given a video, the algorithm can output a location box of the target of interest in the video. It should be noted that the number of objects in the video is not fixed. During the tracking process, each target is also assigned a unique identity information (i.e. a number, denoted by ID). Multi-target tracking has wide applications, such as intelligent monitoring, automatic driving, and the like. With the rapid development of target detection, multi-target tracking methods based on detection gradually become mainstream. In the method, the multi-target tracking problem is mainly solved by data association.
In general, a robust similarity model is the key to solving the data association. In the existing method, the similarity between the targets is mostly calculated based on the characteristics of the target individuals, namely, the similarity model does not consider the correlation between the targets. Although the feature representation capability of the target is stronger and stronger with the development of the deep learning technology, the similarity calculated based on the individual features of the target is also somewhat robust. However, this calculation method still has limitations in complex scenarios. For example, when tracked objects belong to the same category (e.g., pedestrian tracking, vehicle tracking, etc.), the objects have certain similarities in appearance, and frequent occlusion exists inevitably between the objects. As shown in fig. 1, a complex scenario under pedestrian tracking is demonstrated. Part (a) of fig. 1 shows two adjacent pictures, and it can be seen that clothes of different pedestrians are relatively similar in the scene, and there is a certain occlusion between pedestrians. Part (b) of fig. 1 shows the similarity score calculated based on the target individual features, and the basic flow is as follows: firstly, extracting Individual features (industrialized reproduction) of a single target, and then performing Similarity calculation (Similarity Measure); it can be seen that the calculated similarity score is relatively low when the pedestrian is occluded, and relatively high when different pedestrians are similarly dressed. Therefore, in such a complex scenario, the similarity score calculated based on the target individual feature is not reliable enough.
Appearance features of objects are widely used in object tracking. Earlier work utilized manually designed appearance features for multi-target tracking. For example, Yamaguchi et al used raw pixel templates (raw pixel templates) in the article "white are you with and where are you in CVPR 2011", and Izadinia et al used Color histograms (Color histograms) and gradient Histograms (HOG) in the article "(MP) 2: multiple-pest multiple parts tracker. ECCV 2012". With the development of deep learning in recent years, appearance features extracted by using a Convolutional Neural Network (CNN) are widely applied to the field of multi-target tracking.
Motion information of a target is also widely applied to the field of multi-target tracking. These methods using motion information are basically based on an assumption that the motion of the object is smooth and slow. Milan et al, in the article "Continuous energy development for multiple target tracking. TPAMI 2013" designed a linear motion model, Yang et al, in the article "multiple tracking by online learning of non-linear motion patterns and robust application models. CVPR 2012" designed a non-linear motion model. However, the motion of the object in the video is not determined only by the object itself but also by the shooting device. During the shooting process, the device inevitably has a certain jitter, which is generally random and unpredictable. Therefore, it is difficult to solve the camera shake problem simply by relying on a motion model.
The above descriptions are based on the design and utilization characteristics of target individuals, and do not utilize the correlation between targets. Some of the current efforts have been directed to multi-target tracking using the interrelationship between targets. Sadeghian et al designed an Occupancy Map (Occupanacy Map) by partitioning a picture into a grid of fixed size in the article "Tracking the untrackable: left to track multiple crops with long-term dependencies. ICCV 2017". And when a target exists in a certain grid, setting the value corresponding to the placeholder map as 1, otherwise, setting the value as 0. It can be seen that this processing method only roughly records the distribution of objects, and the grid with value 1 in the placeholder map cannot distinguish different objects and the number of objects. Xu et al, in the article "Spatial-temporal relationship networks for multi-object tracking. ICCV 2019", extract the interrelationships between objects using a Relational Network. Specifically, the mutual position relationship between the objects is encoded into a weight, and then the weight is used to fuse the appearance characteristics of other objects in the current frame. It should be noted that the weights utilized by each object in fusing the appearance characteristics of the other objects are different. But this is not very interpretable and ignores the topology between objects.
Disclosure of Invention
The invention aims to provide a similarity calculation method and a multi-target tracking method, which can improve the robustness of a similarity model in multi-target tracking and ensure the tracking effect of the multi-target tracking in a complex scene.
The purpose of the invention is realized by the following technical scheme: a similarity calculation method for multi-target tracking, comprising:
for each target in each video frame, calculating the neighbor of the target, constructing a vertex set by using the appearance characteristics of the target and the neighbor of the target, and calculating a directed edge set by using the correlation between the targets, thereby constructing a directed graph.
And for adjacent video frames, performing matching calculation by using the directed graphs of all targets in the two video frames to obtain a similarity calculation result.
Further, for the target set in the t-th frame is expressed asWherein the ith target is represented asRepresenting a position frame of the ith target, wherein four elements are the coordinate of the upper left corner of the ith target and the width and the height of the position frame respectively;representing picture blocks truncated in the t-th frame according to the position frame, ItIndicating the number of objects in the t-th frame.
Further, K neighbors of the targets are obtained according to the distance between the targets, wherein K is the total number of the neighbors; for the t frame, with the ith targetFor anchor points, the targetAs its own 0 th neighbor, the targetAnd its neighbor forming setWherein the content of the first and second substances,is an objectIs adjacent to the k-th neighbor.
Further, for the ith target in the tth frameThe constructed directed graph is represented asWherein the vertex setThe definition is as follows:
in the formula (I), the compound is shown in the specification,to representOf the appearance characteristics phiACNN(. cndot.) represents the forward function of the convolutional neural network used to extract appearance features.
For directed edge setsFirst, byRepresenting anchor pointsAnd its neighboring neighborRelative position vector between:
in the formula, wtAnd htIs the width and height of the t-th frame, phiRP(-) is a function that calculates the relative position between the targets based on the location box.
Using relative position encoders to align relative position vectorsIs transformed to obtainThereby obtaining a set of directed edges
In the formula, phiRPE(. cndot.) is a relative position encoder.
Further, the hard matching is carried out by utilizing the directed graphs of the targets in the two video frames, and the hard matching comprises the following steps:
given the directed graphs of two objects for adjacent video frames, frame t-1 and frame tAndfirst, a similarity matrix is calculatedThe elements in the k-th row and k' -th column of the matrix are calculated as follows:
in the formula (I), the compound is shown in the specification,representing element subtraction between feature vectors, | · non-2Representing the squaring of elements in a pair vector, [, ]]Means that two vectors are spliced together, phiBC(. cndot.) represents the forward function of the two classifiers.
Finally, the following are obtained:
further, soft matching is carried out by utilizing directed graphs of objects in two video frames, and the soft matching method comprises the following steps:
on the basis of hard matching, firstly performing proximity alignment, and then calculating the similarity, which is expressed as:
in the formula (I), the compound is shown in the specification,is a similarity matrix Si,jRemoving the first row and the first column to obtain a matrix; phi is aLA(. cndot.) is a linear distribution function for completing task distribution and returning the maximum total similarity sum according to the input similarity matrix.
The multi-target tracking method applies the similarity calculation method to the existing multi-target tracking method based on data association to replace a similarity model in the multi-target tracking method.
Further, comprising: finding the lost target in the current frame by using the information of the previous frame, the steps are as follows:
for the ith target in the t-1 frameIf lost at the t frame, then use the ith targetIts k-th neighborRelative position of each otherAndestimate the location frame of the ith target in the t-th frame
In the formula (I), the compound is shown in the specification,is indicative of phiRPInverse function of phiRP(-) is a function that calculates the relative position between the targets based on the location box;to representThe corresponding target in the t-th frame.
For the ith targetAll the K neighbors estimate the position frame of the ith target in the t frame in the mode, and the final position frame of the ith target in the t frame is calculated in the averaging mode
Further, a final position frame of the ith target in the tth frame is obtainedThen based onSeveral candidate boxes are sampled by a gaussian distribution.
For any candidate frame sampledBy usingTo representOne candidate object in the t-th frame,show according to location boxPicture blocks truncated in the t-1 th frame and constructing a directed graphThen, a graph is obtainedAndthe similarity between them.
And (3) extracting the candidate detection result with the highest similarity score from all the candidate detection results, and if the similarity score is larger than a set threshold value, taking the candidate target with the highest score as the candidate targetTracking results in the t-th frame.
The invention has the beneficial effects that: the invention simultaneously utilizes the appearance characteristics of the targets and the relative position characteristics between the targets, and the topological structure between the targets is well coded into the directed graph. The invention improves the accuracy and precision rate of multi-target tracking; and can retrieve lost targets when a certain target is seriously shielded by other targets; more tracking targets can be obtained, and the number of lost targets is reduced.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.
FIG. 1 is a diagram illustrating the effect of the prior art in a complex application scenario;
fig. 2 is a schematic diagram illustrating an effect of the similarity calculation method according to the embodiment of the present invention;
FIG. 3 is a schematic diagram of a constructed directed graph provided by an embodiment of the present invention;
FIG. 4 is a schematic structural diagram of a relative position encoder according to an embodiment of the present invention;
FIG. 5 is a schematic structural diagram of a second classifier according to an embodiment of the present invention;
fig. 6 is a schematic diagram illustrating a change of a directed graph in a missing detection situation according to an embodiment of the present invention;
fig. 7 is a schematic diagram of retrieving a lost target based on a directed graph according to an embodiment of the present invention;
FIG. 8 illustrates tracking performance at different K values provided by embodiments of the present invention;
fig. 9 is a schematic diagram of a visual tracking result provided in the embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.
The similarity calculation method for multi-target tracking can be used as a similarity model to improve the robustness of the similarity model in multi-target tracking, and particularly the similarity model is a graph similarity model which has universality and can replace the similarity model in the conventional multi-target tracking.
As shown in fig. 2, which illustrates the main principle of the similarity calculation method for multi-target tracking, the scenario shown in part (a) of fig. 1 in fig. 2 is taken as an example, and mainly includes:
1. for each target in each video frame, calculating the neighbor of the target, then using the appearance characteristics of the target and the neighbor to construct a vertex set, and using the correlation between the targets to calculate a directed edge set, thereby constructing a directed Graph, namely 'Graph Representation' in fig. 2.
2. For adjacent video frames, Matching calculation is performed by using the directed graphs of the targets in the two video frames to obtain a similarity calculation result, namely 'Graph Matching' in fig. 2.
For convenience of understanding, the following detailed description is made on the above scheme of the present invention, and the scheme is mainly described by two parts, wherein the first part mainly describes data association in multi-target tracking; the second section mainly introduces the principle of the graph similarity model.
Firstly, data association.
Expressed as target set in the t-th frameWherein the ith target is represented asPosition box representing ith target, four elements therein The horizontal and vertical coordinates and the width and height of the upper left corner of the position frame of the ith target are respectively;representing a picture block truncated in the t-th frame according to the location box of the I-th object, ItIndicating the number of objects in the t-th frame.
The data association is performed in two adjacent framesRespectively, the t-1 th frame and the t-th frame. When solving the data association, a cost matrix needs to be providedElement m of ith row and jth column in cost matrixi,jRepresenting objectsAndthe cost is calculated as follows:
in the above formula, phiCI(-, denotes a function that calculates a cost based on the target individual characteristics.
Most of the existing methods mainly learn the characteristics of better target individuals or design better cost function phiCITo improve the robustness of the similarity model, but this does not take into account the interrelationship between objects.
As shown in part (a) of fig. 3, the target detection results in two adjacent frames are shown, and the numbers in the upper left corner of the position frame of the target are used to distinguish different detection results. Taking the 1 st target (i.e. the leftmost pedestrian) as an example, it can be seen thatIs severely shielded and in appearanceThere is a relatively large difference.
In order to utilize the mutual relation between targets, the invention designs a graph similarity model, and the cost calculation mode based on the graph is as follows:
in the above formula, the first and second carbon atoms are,expressed as a targetCreated directed graph (implementation details see introduction below), phiGI(-) represents a function that computes the cost based on the feature representation of both graphs. The interrelationship between objects is embedded in the feature representation of the graph.
Secondly, a graph similarity model.
This section is presented mainly from three aspects: obtaining target neighbor, constructing a directed graph and matching the graph.
1. And acquiring target neighbors.
To create a directed graphFirst needs to be acquired in the t frameK neighbors (K is the total number of neighbors, the value can be set by itself), the target and its neighbors must be in the same frame. There are many ways to obtain the neighbor, which can be referred to the conventional techniques, but the invention is not limited thereto.
By some metric, the distance between the targets can be calculated. The distance is a general expression, and includes, but is not limited to, the Euclidean distance between the center points of the target location frames. In the embodiment of the invention, the neighbor is obtained by using the Euclidean distance between the central points of the target position frames. Using ordered setsRepresenting objectsAnd a set of its neighbors, whereinIs an objectIs adjacent to the k-th neighbor. In addition, defineIs an anchor point. For simplifying writing, it is providedIs its own 0 th neighbor and thusTaking part (a) of fig. 3 as an example, when k is 2,
2. and constructing a directed graph.
For the ith target in the t frameThe constructed directed graph is represented asDirected graphIs based onAnd constructing the structure, wherein the structure consists of K +1 vertexes and K +1 directed edges. Part (b), (c) and (d) of fig. 3 respectively show directed graphs of different objects in two adjacent frames, and although the nodes in the directed graphs may be the same, their edges are different.
in the above formula, the first and second carbon atoms are,to representThe appearance of the glass substrate is as follows,to representIs used to frame the picture block cut in the t-th frameACNN(. The) represents a forward transfer function of the convolutional neural network used to extract appearance features; the convolutional neural network may be implemented in a conventional manner.
In order to exploit the interrelationship between objects, first the mutual position between objects is defined, usingRepresenting anchor pointsAnd its neighboring neighborRelative position vector between:
in the above formula, the first and second carbon atoms are,are respectively asThe horizontal and vertical coordinates and the width and height w of the upper left corner of the position frametAnd htIs the width and height of the t-th frame, phiRP(-) is a function that calculates the relative position between the targets based on the location box;
using relative position encoders to align relative position vectorsIs transformed to obtainThereby obtaining a set of directed edges
In the above formula, the first and second carbon atoms are,to representElement of (5), phiRPE(. cndot.) is a relative position encoder.
Illustratively, relative position vectorThe vector can be an 8-dimensional vector, the 8-dimensional relative position vector can be transformed into a high-dimensional relative position vector by using a method designed by Vaswani in the article "Attention all you needed. Fig. 4 schematically shows the structure of the relative position encoder, which is mainly composed of an fc (full connected) layer, and also utilizes a Batch Normalization (BN) layer and a ReLU activation function.
3. And (6) matching the graphs.
For the adjacencyVideo frames, i.e. frame t-1 and frame t, given directed graphs of two objectsAndfirst, a similarity matrix is calculatedElements of the k-th row and k' -th column of the matrixThe calculation is as follows:
in the above formula, the first and second carbon atoms are,representing element subtraction between feature vectors, | · non-2The representation squares the elements in the vector, [. The representation concatenates the two vectors, phiBC(. denotes the fronthaul function of the Binary Classifier (BC); FIG. 5 schematically shows the structure of the Binary Classifier.
the matching mode is hard matching. In practical application, the detection result is not perfect, and missing detection and false alarm exist, so that the robustness of the similarity score obtained through the calculation of the formula is not high.
As shown in fig. 6, the targetIs missed in the t frame, the corresponding constructed directed graphAndchanges also occur. The similarity score obtained when directly performing a hard match is not reliable because the order of the neighbors changes, i.e., there is no alignment between neighbors.
In order to solve the problem that the neighbors are not aligned, a soft matching scheme is further provided, namely on the basis of hard matching, neighbor pair is firstly carried out, and then similarity is calculatedExpressed as:
in the above formula, the first and second carbon atoms are,is a similarity matrix Si,jRemoving the first row and the first column to obtain a matrix; phi is aLA(. is a linear distribution function which is obtained by modifying Hungarian algorithm and used for completing task distribution and returning the maximum total similarity sum according to the input similarity matrix.
Soft matching may align K neighbors of an anchor point as compared to hard matching. It should be noted, however, that the similarity score obtained by soft matching is always not less than the similarity score obtained by hard matching, i.e. there is always similarityThis is true. It can thus be seen that soft matching has a positive effect when the two anchors in the two graphs are the same target (positive samples) and a negative effect when the anchors in the two graphs are different targets (negative samples). Nevertheless, since the appearance features of the objects and the relative position features between the objects are encoded into the directed graph, so that the feature representation capability of the directed graph is good, the negative influence of the soft matching on the latter is basically negligible. Finally, the aforementioned cost formula can be rewritten as:
preferably, the embodiment of the invention also provides a multi-target tracking method capable of retrieving the lost target.
In the field of multi-target tracking, one target may be occluded by other targets. When the occlusion is severe, it is difficult for the detector to detect the occluded object, and as shown in fig. 6, the rightmost object is lost at the t-th frame. Because the graph similarity model designed by the invention utilizes the topological structure between the targets, the lost targets can be found back by utilizing the graph similarity model in the invention, as shown in fig. 7.
For the ith target in the t-1 frameIf lost at the t frame, then use the ith targetIts k-th neighborRelative position of each otherAndestimate the location frame of the ith target in the t-th frame
In the above formula, the first and second carbon atoms are,is indicative of phiRPInverse function of (phi)RP(-) is a function that calculates the relative position between the targets based on the location box;to representThe corresponding target in the t-th frame.
For the ith targetAll the K neighbors estimate the position frame of the ith target in the t frame in the mode, and the final position frame of the ith target in the t frame is calculated in the averaging mode
Further, the final position frame of the ith target in the obtained t frameThen based onSeveral candidate boxes are sampled by a gaussian distribution. For any candidate frame sampledBy usingTo representOne candidate object in the t-th frame,show according to location boxPicture blocks truncated in the t-1 th frame and constructing a directed graphThen, a graph is obtainedAndthe similarity between them; selecting the candidate object with the highest similarity score from all the candidate objects, and if the similarity score is larger than a set threshold value, taking the candidate object with the highest similarity score as the candidate objectTracking results in the t-th frame.
In the above-mentioned solution of the embodiment of the present invention, a feature representation (i.e. directed graph) of a graph is designed, and the feature representation not only utilizes the features of target individuals, but also utilizes the interrelation between targets. This correlation is represented by a directed graph, which is also in fact a topology between the targets; a characteristic matching mode of the graph is also designed, and a more robust similarity score can be obtained through a reasonable matching mode.
On the other hand, in order to explain the effect of the above scheme of the embodiment of the invention, the graph similarity model is applied to the existing multi-target tracking method based on data association, the similarity model therein is replaced, and the validity of the graph similarity model is verified through experiments.
Experiments were performed on motchelinge (https:// motchallange. net /) to analyze the merit and positive effects of the graph similarity model. The data sets used included MOT16 and MOT17, with the following evaluation indices:
1) Multi-Object Tracking Accuracy (MOTA), the higher the index the better.
2) Multi-Object Tracking Precision (MOTP), the higher the index the better;
3) the frequency of the same ID assigned to the same object (IDF 1, the higher the index is, the better the index is;
4) the number of most Tracked objects (MT), the index is as high as possible;
5) the number of most missing objects (ML), the lower the indicator, the better;
6) the number of times that the target ID changes (IDS), the lower the index, the better;
7) the number of discontinuous target tracks (Frag), the lower the index, the better;
8) the number of missed targets (FN), the lower the index is, the better;
9) the lower the False alarm count (FP), the better the indicator.
1. Details of the experiment.
In the experiment, all picture blocks were scaled to a size of 64 × 128. The convolutional neural network used to extract appearance features is implemented based on ResNet-34(Deep residual learning for image recognition. CVPR 2016): the last FC layer is removed to obtain 2 × 4 × 256 features, then the features are transformed into 2048-dimensional vectors, and finally the vectors are input into an FC layer to obtain 256-dimensional appearance feature vectors. The relative position characteristic of the relative position encoder RPE output is also 256 dimensions.
Appearance rollThe product neural network, the relative position encoder and the classifier are trained end to end, and 30 times on a training set by using a binary cross entropy loss function. Input to the classifier during trainingThe method comprises the following steps of dividing a positive sample into a negative sample: when anchor pointAndare the same target and are close to each otherAndwhen the target is the same, the input is a positive sample, and the other samples which do not meet the requirement are negative samples. The learning rate was initialized to 0.002, and the learning rate was reduced to 1/2 for each 10 training sessions. In addition, the problem of unbalance of positive and negative samples is solved by using online difficult sample mining.
2. Ablation experiment
In order to verify the validity of the GSM (graph similarity model) in the invention, a basic model (using the GSM) is also designedRepresentation).There is no RPE (relative position encoder) and the classifier only classifies according to the appearance features of the object (determines whether two appearance features belong to the same object). Two trackers are further designed, and the only difference between the two trackers is the difference between the similarity models. The two trackers track by correlating objects in matching adjacent frames. The 7 videos in the MOT16 training set were further divided into a training subset and a verification subset. The verification subset includes MOT16-09 and MOT16-10, the remainderThe 5 videos constitute a training subset. The tracking results of the different models on the validation subset are shown in table 1.
Table 1: tracking performance on verification subsets
In table 1, the superscript numbers of the different models represent the number K of neighbors used. The first two rows compare the performance of multi-target tracking without neighbor, and the tracking performance of the two models is basically the same. This is understandable because of the foregoingIt can be seen from the calculation formula that the relative position of the target to itself is an 8-dimensional all-zero vector, which means that GSM only utilizes appearance features when the number of neighbors K is 0.
The middle four rows compare the effect of hard matching (subscript h) and soft matching (subscript s) on tracking performance. Five neighbors were used. For theAnd (4) modeling, wherein the constructed graph only has nodes and no edges. It can be seen that the hard match and soft match pairsThe models all have negative effects for the following reasons: (1) for two graphs with different anchor points, if the same target exists in the adjacent neighbors, the similarity of the two graphs can be increased by the two matching modes; (2) for two graphs with different anchor points, the negative effect of soft matching is not negligible because only appearance features and no relative position features are used at this time. For the GSM model, when 5 neighbors are used, the hard matching instead degrades tracking performance (compare GSM to GSM)0And). The reason is that when the anchor points of the two graphs are the same target and the neighbors are not aligned, there is a hard matchThe resulting similarity will be low. Contrast to GSM0Andit can be seen thatThe IDF1 was 5.9% higher and the IDS was lower, indicating thatWhen tracking, the same target is assigned the same ID more frequently.
The last line shows the influence on the tracking performance after finding the lost target, and the method is usedAnd (4) showing. When retrieving objects, 64 candidate objects are sampled for each missing object. It can be seen thatIs lower, indicating that a partially lost object has been retrieved. But instead of the other end of the tubeFP of (a) is also higher, indicating that some of the retrieval targets failed.
In order to find a proper K value, a lot of experiments are carried out by using soft matching, as shown in FIG. 8, and overall, when K is more than or equal to 5, the MOTA is slightly improved, and the IDF1 is basically kept unchanged. The reason is that: with the increase of the K value, when the anchor points in the two graphs are the same target, the obtained similarity score is higher (more reliable); conversely, when the anchor points in the two figures are different targets, the resulting similarity score is also higher (less reliable). Such positive and negative effects counteract each other. To trade off tracking performance against time, K is set to 5. On the verification set, a directed graph is created, and the similarity score between the two graphs is calculated, which takes 0.15ms and 0.03ms respectively. Will be provided withAfter the model was replaced with the GSM model, the tracking speed dropped from 93.7FPS to 61.5 FPS.
Some tracking results are visualized in fig. 9, first row: n aive0Tracking a result; second rowThe tracking result of (2);the tracking result of (1). In frame 27 there are two objects within the dashed box, denoted object 2 on the right and object 8 on the left. At frame 29, the object 2 is partially occluded by the object 8, resulting in the object 2 not being detected. However, the position of the target 2 is controlledIs well estimated. At frame 49, target 2 is completely occluded by target 8, Naive0Erroneously recognizing target 8 as target 2, butAndthe target is correctly identified (rectangle box to the right in the middle).
3. Results on MOTChalleng
The GSM model in the invention is applied to tracker (Tracking with out balls and places. ICCV 2019) with the best performance, which is published currently and is recorded as GSMTracktor. In addition, the algorithms MTDF (Multi-level cooperative fusion of gm-phd filters for on-line multiple human Tracking. IEEE Transactions on Multimedia2019), STAM (on-line multiple Tracking using cn-based single object Tracking with spatial-temporal Tracking. ICCV2017), DMMOT (on-line multiple-object Tracking with temporal Tracking around ECCV 2018), and AMIR (Tracking along with Tracking around to Tracking around on-line multiple Tracking around ECCV 2018) are also usedMultiple documents with long-term dependencies. ICCV2017), STRN (Spatial-temporal correlation networks for multiple-object tracking. ICCV 2019), DMAN (Online multi-object tracking with dual-description mapping. ECCV 2018), HAM-SADF (Online multi-object tracking with historical addressing mapping. AVSS 2018), MOTDT (Real-time multi-object tracking with historical mapped selection and characterization mapping. AVSS 2018), and FAMNT (Real-time multi-object tracking with historical mapped selection and characterization mapping. 2018). And tested on test sets of MOT16 and MOT 17. The test results are shown in table 2.
Table 2: tracking performance of different tracking algorithms on MOTChalnge
At MOT16, GSM in addition to MOTP, FP, and IDSTracktorThe best tracking performance was achieved, among other things, and ranked second (475) on the IDS, just a little higher than the IDS of the first (473). GSM in contrast to TracktorTracktorThe improvements were 3.6% and 5.7% on MOTA and IDF1, respectively, and the IDS was also reduced by 30.4%. At MOT17, see GSM as a wholeTracktorThe best performance is also achieved. GSM in contrast to TracktorTracktorBetter results were obtained in almost all indicators. In particular, the improvements on MOTA and IDF1 were 2.9% and 5.5%, respectively, and IDS was also reduced by more than 20%. GSM (Global System for Mobile communications)TracktorThe better performance is obtained on IDS and IDF1, which shows that the GSM model has good characteristic representation capability and the calculated similarity is more robust.
Through the above description of the embodiments, it is clear to those skilled in the art that the above embodiments can be implemented by software, and can also be implemented by software plus a necessary general hardware platform. With this understanding, the technical solutions of the embodiments can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions for enabling a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods according to the embodiments of the present invention.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.
Claims (9)
1. A similarity calculation method for multi-target tracking, comprising:
for each target in each video frame, calculating the neighbor of the target, constructing a vertex set by using the appearance characteristics of the target and the neighbor of the target, and calculating a directed edge set by using the correlation between the targets, thereby constructing a directed graph.
And for adjacent video frames, performing matching calculation by using the directed graphs of all targets in the two video frames to obtain a similarity calculation result.
2. The similarity calculation method for multi-target tracking according to claim 1, wherein the similarity is expressed for the target set in the t-th frameWherein the ith target is represented asRepresenting a position frame of the ith target, wherein four elements are the coordinate of the upper left corner of the ith target and the width and the height of the position frame respectively;representing picture blocks truncated in the t-th frame according to the position frame, ItIndicating the number of objects in the t-th frame.
3. The similarity calculation method for multi-target tracking according to claim 2, characterized in that K neighbors of the targets are obtained according to the distance between the targets, wherein K is the total number of the neighbors; for the t frame, with the ith targetFor anchor points, the targetAs its own 0 th neighbor, the targetAnd its neighbor forming setWherein the content of the first and second substances,is an objectIs adjacent to the k-th neighbor.
4. The similarity calculation method for multi-target tracking according to claim 3, wherein for the ith target in the tth frameThe constructed directed graph is represented asWherein the vertex setThe definition is as follows:
in the formula (I), the compound is shown in the specification,to representOf the appearance characteristics phiACNN(. cndot.) represents the forward function of the convolutional neural network used to extract appearance features.
For directed edge setsFirst, byRepresenting anchor pointsAnd its neighboring neighborRelative position vector between:
in the formula, wtAnd htIs the width and height of the t-th frame, phiRP(-) is a function that calculates the relative position between the targets based on the location box.
Using relative position encoders to align relative position vectorsIs transformed to obtainThereby obtaining a set of directed edges
In the formula, phiRPE(. cndot.) is a relative position encoder.
5. The similarity calculation method for multi-target tracking according to claim 4, wherein the hard matching using the directed graph of each target in two video frames comprises:
given the directed graphs of two objects for adjacent video frames, frame t-1 and frame tAndfirst, a similarity matrix is calculatedThe elements in the k-th row and k' -th column of the matrix are calculated as follows:
in the formula (I), the compound is shown in the specification,representing element subtraction between feature vectors, | · non-2Representing the squaring of elements in a pair vector, [, ]]Means that two vectors are spliced together, phiBC(. cndot.) represents the forward function of the two classifiers.
Finally, the following are obtained:
6. the similarity calculation method for multi-target tracking according to claim 5, wherein the soft matching is performed by using the directed graphs of the targets in the two video frames, and comprises the following steps:
on the basis of hard matching, firstly performing proximity alignment, and then calculating the similarity, which is expressed as:
in the formula (I), the compound is shown in the specification,is a similarity matrix Si,jRemoving the first row and the first column to obtain a matrix; phi is aLA(. cndot.) is a linear distribution function for completing task distribution and returning the maximum total similarity sum according to the input similarity matrix.
7. A multi-target tracking method is characterized in that the method of any one of claims 1 to 6 is applied to an existing multi-target tracking method based on data association to replace a similarity model in the existing multi-target tracking method.
8. The multi-target tracking method according to claim 7, comprising: finding the lost target in the current frame by using the information of the previous frame, the steps are as follows:
for the ith target in the t-1 frameIf lost at the t frame, then use the ith targetIts k-th neighborRelative position of each otherAndestimate the location frame of the ith target in the t-th frame
In the formula (I), the compound is shown in the specification,is indicative of phiRPInverse function of phiRP(-) is a function that calculates the relative position between the targets based on the location box;to representThe corresponding target in the t-th frame.
9. The multi-target tracking method according to claim 8, wherein a final position frame of the ith target in the tth frame is obtainedThen based onSeveral candidate boxes are sampled by a gaussian distribution.
For any candidate frame sampledBy usingTo representOne candidate object in the t-th frame,show according to location boxPicture blocks truncated in the t-1 th frame and constructing a directed graphThen, a graph is obtainedAndthe similarity between them.
And (3) extracting the candidate detection result with the highest similarity score from all the candidate detection results, and if the similarity score is larger than a set threshold value, taking the candidate target with the highest score as the candidate targetTracking results in the t-th frame.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110695292.5A CN113409356A (en) | 2021-06-23 | 2021-06-23 | Similarity calculation method and multi-target tracking method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110695292.5A CN113409356A (en) | 2021-06-23 | 2021-06-23 | Similarity calculation method and multi-target tracking method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113409356A true CN113409356A (en) | 2021-09-17 |
Family
ID=77682492
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110695292.5A Pending CN113409356A (en) | 2021-06-23 | 2021-06-23 | Similarity calculation method and multi-target tracking method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113409356A (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111882580A (en) * | 2020-07-17 | 2020-11-03 | 元神科技(杭州)有限公司 | Video multi-target tracking method and system |
EP3770854A1 (en) * | 2018-09-14 | 2021-01-27 | Tencent Technology (Shenzhen) Company Limited | Target tracking method, apparatus, medium, and device |
-
2021
- 2021-06-23 CN CN202110695292.5A patent/CN113409356A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP3770854A1 (en) * | 2018-09-14 | 2021-01-27 | Tencent Technology (Shenzhen) Company Limited | Target tracking method, apparatus, medium, and device |
CN111882580A (en) * | 2020-07-17 | 2020-11-03 | 元神科技(杭州)有限公司 | Video multi-target tracking method and system |
Non-Patent Citations (1)
Title |
---|
QIANKUN LIU ET AL.: "GSM: Graph Similarity Model for Multi-Object Tracking", 《PROCEEDINGS OF THE TWENTY-NINTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE (IJCAI-20)》 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Chen et al. | Real-time multiple people tracking with deeply learned candidate selection and person re-identification | |
CN108470354B (en) | Video target tracking method and device and implementation device | |
Lynen et al. | Placeless place-recognition | |
Lee et al. | Place recognition using straight lines for vision-based SLAM | |
CN108960047B (en) | Face duplication removing method in video monitoring based on depth secondary tree | |
CN106780557A (en) | A kind of motion target tracking method based on optical flow method and crucial point feature | |
CN104200495A (en) | Multi-target tracking method in video surveillance | |
Fazli et al. | Particle filter based object tracking with sift and color feature | |
Li et al. | Robust object tracking with discrete graph-based multiple experts | |
Li et al. | Adaptive and compressive target tracking based on feature point matching | |
Poiesi et al. | Tracking multiple high-density homogeneous targets | |
Wan et al. | Tracking beyond detection: learning a global response map for end-to-end multi-object tracking | |
Xiang et al. | Multitarget tracking using hough forest random field | |
Leyva et al. | Video anomaly detection based on wake motion descriptors and perspective grids | |
CN111091583A (en) | Long-term target tracking method | |
CN113409356A (en) | Similarity calculation method and multi-target tracking method | |
Narayan et al. | Learning deep features for online person tracking using non-overlapping cameras: A survey | |
Han et al. | Multi-target tracking based on high-order appearance feature fusion | |
Liu et al. | [Retracted] Mean Shift Fusion Color Histogram Algorithm for Nonrigid Complex Target Tracking in Sports Video | |
Taalimi et al. | Robust multi-object tracking using confident detections and safe tracklets | |
Yang et al. | Keyframe-based camera relocalization method using landmark and keypoint matching | |
Bai et al. | Pedestrian Tracking and Trajectory Analysis for Security Monitoring | |
Lu et al. | A robust tracking architecture using tracking failure detection in Siamese trackers | |
Tang et al. | An online LC-KSVD based dictionary learning for multi-target tracking | |
Nithin et al. | Multi-camera tracklet association and fusion using ensemble of visual and geometric cues |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20210917 |
|
WD01 | Invention patent application deemed withdrawn after publication |