CN116563345A

CN116563345A - Multi-target tracking method based on point-surface matching association

Info

Publication number: CN116563345A
Application number: CN202310600166.6A
Authority: CN
Inventors: 于纯妍; 潘锰; 宋梅萍; 于浩洋; 张强
Original assignee: Dalian Maritime University
Current assignee: Dalian Maritime University
Priority date: 2023-05-25
Filing date: 2023-05-25
Publication date: 2023-08-08

Abstract

The invention discloses a multi-target tracking method based on point-plane matching association, which comprises the steps of acquiring a video to be detected, serializing, constructing a detector according to a feature extraction network, sequentially carrying out target detection on video frames in a video frame sequence according to the detector, acquiring the t-th and t+1th video frames from the video frame sequence, dividing targets in the t+1th video frame into high-confidence targets and low-confidence targets, carrying out first association matching on the high-confidence targets according to a target frame similarity perception algorithm, adding successfully matched targets into target tracks of the t-th video frame, respectively initializing target tracks for failed targets, carrying out second association matching on the low-confidence targets, adding the successfully matched targets into target tracks of the t-th video frame, discarding the failed targets, and sequentially traversing all video frames until all target tracks are acquired. The detection confidence of the blocked target is improved, and meanwhile, the accuracy of multi-target track association is improved.

Description

Multi-target tracking method based on point-surface matching association

Technical Field

The invention relates to the field of multi-target tracking, in particular to a multi-target tracking method based on point-surface matching association.

Background

Visual object tracking is a widely used computer vision task that unlike object classification and object detection, object tracking can be divided into two tasks according to different scenarios: 1) Single target tracking task (VOT); 2) Multiple target tracking task (MOT). The single target tracking is to continuously mark the target in a video sequence according to any target category given in a first frame and in a subsequent sequence; multi-object tracking is to detect and track objects that may appear or disappear in a video sequence of known object categories, and requires marking IDs for the tracked objects.

Early tracking schemes often tracked the point of interest in space-time, trackers were often relatively simple, and were fast and stable for applications in simple scenarios with slow linear motion, rare targets, etc. However, in the case of encountering a low-level cue such as an image edge portion, the tracking effect is very poor. With the development of computer vision, many detection-based object tracking algorithms are becoming popular. The method mainly relies on a target detection model to detect targets, and needs to perform data correlation on the targets of the front frame and the rear frame to connect the motion tracks between the frames, and the two-stage tracking strategy of detection before tracking is currently a mainstream multi-target tracking mode. The detection-based tracking paradigm mostly takes advantage of the high performance of deep-learning-based target detectors, but the data correlation algorithm of trackers remains slow and complex.

Disclosure of Invention

The invention provides a multi-target tracking method based on point-plane matching association, which aims to overcome the technical problems.

A multi-target tracking method based on point-plane matching association comprises the following steps,

step one, obtaining a video to be detected, carrying out serialization processing on the video to obtain a video frame sequence,

step two, constructing a detector according to the characteristic extraction network, sequentially carrying out target detection on video frames in the video frame sequence according to the detector and marking all detected targets,

step three, let t=1, initialize the target track according to the detected target in the first video frame, initialize the buffer pool, the buffer pool is used for storing the target track set,

step four, acquiring the t and t+1th video frames from the video frame sequence, dividing the targets in the t+1th video frame into high-confidence targets and low-confidence targets, carrying out first correlation matching on the high-confidence targets according to a target frame similarity perception algorithm, adding the successfully matched targets into target tracks of the t video frame, respectively initializing target tracks, storing the target tracks in a buffer pool, carrying out second correlation matching on the low-confidence targets, adding the successfully matched targets into target tracks of the t video frame, discarding the targets with the failed matching, wherein the second correlation matching comprises the steps of acquiring the low-confidence targets and representing the low-confidence targets as the targets to be matched, respectively modeling the targets to be matched and the targets in the t video frame through two-dimensional Gaussian distribution, acquiring Gaussian distribution probability of each target after modeling, carrying out normalization processing on the distance matrixes according to two-dimensional Gaussian distribution probability of all the targets in the t+1th video frame, respectively updating the confidence matrixes according to the confidence matrixes of the t+1th video frame, carrying out similarity updating on the targets in the t+1th video frame when the confidence matrixes meet the similarity conditions of the t+1st video frame, and the low-confidence matrixes in the t+1st video frame, respectively updating the confidence matrixes when the t+1st video frames in the low-confidence frames have the low-confidence matrixes, and the t+1st video frames have the similarity matching conditions that the targets are satisfied,

and fifthly, judging whether the t video frame is the last video frame in the video frame sequence, if so, acquiring all target tracks, otherwise, enabling t=t+1, and re-executing the fourth step.

Preferably, the constructing the detector according to the feature extraction network includes modifying the feature extraction network, the modifying includes using a deep aggregation network as a backbone network of the feature extraction network, the feature extraction network uses a deformable convolution network to perform feature extraction, uses a deep aggregation upsampling module to perform feature fusion, and adds a global context information extraction module to the feature extraction network, and uses the modified feature extraction network as the detector.

Preferably, said modeling of the object to be matched and the object in the t-th video frame by two-dimensional gaussian distribution respectively comprises modeling according to formulas (1), (2),

wherein (a, b) represents the center coordinates of the target, w, h represents the half axis of the center of the target along the x-axis and the half axis along the y-axis, respectively, x, μ, Σ represents the coordinates (x, y) of the gaussian distribution, the mean vector and the covariance matrix, respectively, such that the mean vector μ= [ a, b ]]Covariance matrix Σ=diag [ w ] ² /4,h ² /4]The target follows a two-dimensional gaussian distribution N (μ, Σ), denoted B to N (μ, Σ).

Preferably, the similarity matching of the low confidence targets in the (t+1) th video frame according to the similarity matrix includes setting a similarity thresholdObtaining a similarity matrix S _nm N represents the number of low confidence targets in the (t+1) th video frame, m represents the number of target tracks of the (t) th video frame, when S _ij When the similarity threshold is not satisfied, the ith target is not related to the jth item target track, and S is _ij Set to 0, otherwise, not modify S _ij I represents the ith target in the (t+1) th video frame, j represents the jth item mark track in the (t) th video frame, i is less than or equal to n, j is less than or equal to m, and S is updated _nm Inputting the target track to a greedy algorithm, and acquiring the j-th item target track most related to the i-th target.

Preferably, the dividing the targets in the t+1th video frame into the high confidence targets and the low confidence targets includes setting a tracking threshold, obtaining all the targets in the t+1th video frame, calculating the confidence according to the pixel values of the targets, and when the confidence of the targets exceeds the tracking threshold, taking the targets as the high confidence targets, otherwise, taking the targets as the low confidence targets.

The invention provides a multi-target tracking method based on point-plane matching association, which predicts the track of a target by measuring the normalized Gaussian similarity distance between the target and the track of the target in two adjacent video frames and improves the accuracy of multi-target track association. The method comprises the steps of constructing a detector, extracting global context information of low-level features before a feature extraction network, enhancing the expression capability of key texture information of the low-level features, improving the detection confidence of blocked targets, realizing secondary association matching and caching strategies of multi-target tracks, carrying out secondary matching on blocked targets or non-matched targets based on a PS association algorithm, and applying a caching mechanism to broken tracks to form long-term stable multi-target track association.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the drawings that are needed in the embodiments or the description of the prior art will be briefly described below, it will be obvious that the drawings in the following description are some embodiments of the present invention, and that other drawings can be obtained according to these drawings without inventive effort to a person skilled in the art.

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a schematic diagram of a one-stage point-to-face matching association of the present invention;

FIG. 3 is a flow chart of an embodiment of the present invention;

FIG. 4 is a graph of multi-objective tracking results in an embodiment of the invention;

FIG. 5 (a) is a graph showing the analysis of the values of MOTA, IDF1 indicators at the NWS similarity constant C of the invention;

FIG. 5 (b) is a graph showing the analysis of the values of the IDs index under the NWS similarity constant C of the present invention;

FIG. 6 (a) is a comparative test of MOTA indicator at buffer frame number value N according to the present invention;

FIG. 6 (b) is a comparative test of the IDF1 index at the buffer frame number value N according to the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

FIG. 1 is a flowchart of the method of the present invention, as shown in FIG. 1, the method of the present embodiment may include:

Based on the scheme, the track of the target is predicted by measuring the normalized Gaussian Wasserstein similarity distance between the target and the track of the target in two adjacent video frames, so that the accuracy of multi-target track association is improved. The method comprises the steps of constructing a detector, extracting global context information of low-level features before a feature extraction network, enhancing the expression capability of key texture information of the low-level features, improving the detection confidence of blocked targets, realizing secondary association matching and caching strategies of multi-target tracks, carrying out secondary matching on blocked targets or non-matched targets based on a PS association algorithm, and applying a caching mechanism to broken tracks to form long-term stable multi-target track association.

step two, a detector is constructed according to a feature extraction network, global attention information of low-level features is extracted before the feature extraction network, the interesting features are optimized by using rich texture information and channel information, the detector is constructed according to the feature extraction network, the modification comprises the steps of taking a deep aggregation network as a main network of the feature extraction network, the feature extraction network adopts a deformable convolution network to perform feature extraction, adopts a deep aggregation up-sampling module to perform feature fusion, a global context information extraction module GCE is added into the feature extraction network, the modified feature extraction network is taken as the detector,

the feature extraction network adopts a deep aggregation network DLA as a backbone network to extract multi-level features. The DLA backbone network adds more jump connections similar to the feature pyramid network FPN between the low-level features and the high-level features, adopts an iterative deep aggregation up-sampling module IDAUp in the fusion part, and replaces the common convolution network with a deformable convolution network DCN capable of dynamically adjusting the receptive field according to the target size to fuse the multi-scale features extracted by the DLA.

Input: the feature extraction network allows an input image size of 3 x H x W, a total downsampling rate of the DLA backbone network of 32,

and (3) outputting: after passing through the IDAUp fusion up-sampling network, a 4-times down-sampling feature map is finally output, and the size of the feature map is 64 XH/4 XW/4.

The attention mechanism adopted by the GCE comprises a space attention part and a channel attention part, and the interest points are simultaneously focused through the space attention part and the channel attention part, so that the feature extraction network is allowed to focus on more important features in the input feature map F, and the network base layer is helped to better process the space information and the channel information of the input data and extract related features.

The formula for GCE: as in formula (1):

wherein MP and AP are respectively a maximum pooling layer and an average pooling layer, W is a convolution layer weight, sigma is a Sigmoid activation function, GMP and GAP are respectively a global maximum pooling layer and a global average pooling layer, MLP is a multi-layer perceptron,representing element-by-element product, A _s Representing spatial attention, A _c Representing channel attention.

Sequentially carrying out target detection on video frames in a video frame sequence according to a detector, marking all detected targets, and detecting target frame information size= (a, b, w, h), (a, b) of all targets of a t-th frame in the video by the detector, wherein (w, h) is a target center coordinate, and (w) is the width and the height of a target frame where the targets are located;

step three, let t=1, initialize the target track according to the detected target in the first video frame, the initialized target track is a set storing the position information of the target in the video frame, the video frame information and the like with the same target number, specifically including a target number, a target center coordinate, a target length, a target width, a buffer duration and a frame sequence number set, the buffer duration is the duration that the target appears in the video, the frame sequence number set is the sequence number set of all video frames when the target appears in the video frame, initialize a buffer pool, the buffer pool is used for storing the target track set,

step four, acquiring the t and t+1th video frames from the video frame sequence, dividing the targets in the t+1th video frame into high-confidence targets and low-confidence targets, wherein the dividing the targets in the t+1th video frame into the high-confidence targets and the low-confidence targets comprises setting a tracking threshold value, acquiring all targets in the t+1th video frame, calculating the confidence coefficient according to the pixel values of the targets, taking the targets as the high-confidence targets when the confidence coefficient of the targets exceeds the tracking threshold value, otherwise, taking the targets as the low-confidence targets,

performing first-time association matching on the high-confidence targets according to a target frame similarity sensing algorithm, adding the successfully matched targets into target tracks of the t-th video frame, respectively initializing target tracks of the targets which are failed to be matched, and storing the target tracks in a cache pool, wherein the performing first-time association matching according to the target frame similarity sensing algorithm comprises calculating normalized Gaussian Wasserstein similarity between all the high-confidence targets and all the target track frames of the previous frame, performing one-to-one matching according to the maximum value of similarity scores between the high-confidence targets, and assigning the same target numbers for the matched high-confidence targets, wherein the high-confidence targets which are not matched are assigned with new target numbers.

Performing a second correlation matching on the low confidence level target, adding the target successfully matched into the target track of the t-th video frame, discarding the target unsuccessfully matched, wherein the second correlation matching comprises the steps of acquiring the target with low confidence level and representing the target as the target to be matched, respectively modeling the target to be matched and the target in the t-th video frame through two-dimensional Gaussian distribution,

modeling the target to be matched and the target in the t-th video frame respectively through two-dimensional Gaussian distribution comprises modeling according to formulas (2) and (3),

wherein (a, b) represents the center coordinates of the target, w, h represents the half axis of the center of the target along the x-axis and the half axis along the y-axis, respectively, x, μ, Σ represents the coordinates (x, y) of the gaussian distribution, the mean vector and the covariance matrix, respectively, such that the mean vector μ= [ a, b ]]Covariance matrix Σ=diag [ w ] ² /4,h ² /4]The targets follow a two-dimensional Gaussian distribution N (mu, sigma), denoted as B-N (mu, sigma), the detected targets are remodelled by adopting the two-dimensional Gaussian distribution to obtain probability density of each target two-dimensional Gaussian distribution, and the distribution weight (the weight refers to the probability of the two-dimensional Gaussian distribution) of the foreground pixels from the center of the target to the edge of the target frame is gradually reduced.

Acquiring two-dimensional Gaussian distribution probability of each target after modeling, constructing a distance matrix according to the two-dimensional Gaussian distribution probability of all targets in the t-th and t+1th video frames,

the construction distance matrix specifically comprises,

the two-dimensional Gaussian distribution of the current frame target 1 is N ₁ (μ ₁ ,Σ ₁ ) The two-dimensional gaussian distribution of the previous frame object 2 is N ₂ (μ ₂ ,Σ ₂ ) The Wasserstein distance between the two is described as equation (4).

Simplification can yield equation (5):

wherein I II ₂ Representing 2-normal form, I _p Representing the p-norm. According to the above formula, bring into the target frame B ₁ ＝(a ₁ ,b ₁ ,w ₁ ,h ₁ ),μ ₁ ＝[a ₁ ,b ₁ ]，Σ ₁ ＝diag[w ₁ ² /4,h ₁ ² /4]Target frame B ₂ ＝(a ₂ ,b ₂ ,w ₂ ,h ₂ ),μ ₂ ＝[a ₂ ,b ₂ ]，Σ ₂ ＝diag[w ₂ ² /4,h ₂ ² /4]Obtaining a formula (6):

where W represents the Wasserstein distance matrix. The adjacent video W matrix is as in equation (7):

where m represents the number of target tracks of the previous frame (including all detected targets and non-matched targets, i.e. all tracks in the buffer pool), n represents the number of targets of the current frame, and d represents the wasperstein distance between a certain target track of the previous frame and the detected target of the current frame.

The distance matrix is normalized to obtain a similarity matrix of the (t+1) th video frame,

the construction of the normalized similarity matrix specifically comprises the construction of a measurement method for evaluating the similarity between targets based on the Wasserstein distance of Gaussian distribution, which is called a point-plane matching association algorithm and is marked as a PS algorithm,

because the value range is not in the range of [0,1], W cannot be directly used as a similarity measurement matrix for measuring the two-dimensional Gaussian distribution of the target frame. However, the normalized gaussian similarity matrix, abbreviated as normalized waserstein similarity (Normalized Wasserstein Similarity, NWS), may be obtained by normalizing W in the form of a natural exponent, as in equation (8).

Where C is a constant, which is primarily related to the target size of the dataset, a larger value should be given to the constant C as the pixel area of the target in the image increases. NWS matrix as in equation (9):

respectively carrying out similarity matching on low confidence coefficient targets in the (t+1) th video frame according to the similarity matrixAs shown in fig. 2, the performing similarity matching on the low confidence targets in the (t+1) th video frame according to the similarity matrix respectively includes setting a similarity threshold to obtain a similarity matrix S _nm N represents the number of targets in the (t+1) th video frame, m represents the number of target tracks of the (t) th video frame, when S _ij When the similarity threshold is not satisfied, the ith target is not related to the jth item target track, and S is _ij Set to 0, otherwise, not modify S _ij I represents the ith target in the (t+1) th video frame, j represents the jth item mark track in the (t) th video frame, i is less than or equal to n, j is less than or equal to m, and S is updated _nm Inputting the target tracks into a greedy algorithm to obtain a j-th target track most relevant to an i-th target, specifically, measuring the similarity between target frames by using a PS algorithm based on normalized Wasserstein similarity in the data association process, further converting the W matrix into an NWS matrix capable of carrying out similarity measurement by calculating a W distance matrix of all target tracks and all detected targets of the current frame, and giving a similarity threshold sigma _NWS Removing less than sigma _NWS And (3) finding the basis of the track association of all the matching results with the highest similarity in the NWS matrix through a greedy algorithm.

Adding the low confidence coefficient target in the t+1th video frame into the target track of the t video frame when the low confidence coefficient target in the t+1th video frame meets the similarity matching condition, storing the updated target track of the t video frame in a buffer pool, discarding the low confidence coefficient target in the t+1th video frame when the low confidence coefficient target in the t+1th video frame does not meet the matching condition,

In the embodiment, all detected targets are divided into high-confidence targets and low-confidence targets according to a detection threshold, the high-score targets are matched with tracks in a cache pool, and second-time association matching is carried out on the unmatched target tracks and the low-score targets; reserving the final unmatched tracks in a cache pool, as shown in fig. 3; specifically, S51 is a two-dimensional Gaussian distribution similarity measure of a target frame

When the detection module of the detector outputs a Sigmoid normalized heat map, all pixel points represent the confidence of the target and the detection threshold value theta is used for detecting the target _det Confidence is greater than theta _det Is preserved, other confidence levels are less than theta _det Is set to 0.

Setting a tracking threshold value theta when correlating the target track _track ∈(θ _det 1) target confidence in [ θ ] _track The center position and bounding box information of the high confidence targets of the 1) range are modeled as a two-dimensional gaussian distribution.

S52, correlation matching of high confidence targets

For the first frame, the track is initialized for all detected targets.

For subsequent frames, a target frame similarity sensing algorithm is used to calculate normalized gaussian Wasserstein similarity between all predicted frames and the previous frame track frame. And (3) carrying out one-to-one matching according to the maximum value of the similarity score between the two, distributing the same ID for the matched prediction frames, endowing new IDs for the unmatched prediction frames, generating new tracks, and retaining unmatched tracks.

S53, second matching of unmatched tracks

Obtaining unmatched tracks, and ensuring that the confidence coefficient between the tracks and the target is [ theta ] _det ,θ _track ) All prediction frames in the interval are matched by using a PS algorithm, the prediction frames which are not matched currently are cleared, and the rest of unmatched historical tracks are reserved.

S54, trace state caching mechanism

Updating all track states (including the center coordinates of the target, size information, cache duration and activity state) and adding the track states into a cache pool. Resetting the cache time length of the matched track, marking the cache time length as an active state, increasing the cache time length of the unmatched track, and marking the cache time length as an inactive state.

The present invention was validated by MOT17 dataset provided by MOT Challenge, in the specific procedure,

A. data sources: the multi-target tracking dataset is derived from the MOT17 dataset provided by MOT Challenge. Which contains 7 training sequences and 7 test sequences, each providing 3 sets of detectors DPM, fast-RCNN and SDP, 21 training sequences and 21 test sequences. The video frame rate is 25-30 frames/second.

Table 1 shows the MOT17 dataset video sequence names and corresponding image sizes and sequence lengths:

TABLE 1

B. For this dataset, the first half of each training sequence is taken as the training set and the second half is taken as the validation set.

C. This section is mainly directed to performing an ablation experiment on a verification set of the MOT17 dataset by the global context information extraction module GCE and the PS correlation algorithm with the secondary correlation matching policy and buffering mechanism, as shown in table 2. The process of performing target detection is shown in fig. 4. MOT17_half represents pre-training on a CrowdHuman data set, training a model on a MOT17 semi-training set, and finally performing index evaluation on a MOT17 verification set; only CrowdH means model weights are used to perform index evaluation on MOT17 validation set using Only weights trained on CrowdHuman data; the Scratch represents training on MOT17 half training set only, testing on MOT17 validation set.

Experiments show that the GCE module and PS correlation algorithm provided by the invention can obviously improve a target tracking model of a CenterTrack at one stage. The multi-target tracking accuracy (Multiple Object Tracking Accuracy, MOTA) index in mot17_half and Only crowdh. Experiments increased by 1.6% and 3.2%, respectively, and the recognition detection F1 score (Identified Detections F-score, IDF 1) increased by 6.7% and 5.2%, respectively. The curves of the influence of C on different indexes are shown in fig. 5 (a) and 5 (b), and the comparison result of the influence of the centrtrack and the PSTrack on the MOTA indexes under different N is shown in fig. 6 (a) and 6 (b), in the Scratch experiment, since the feature extraction network of the centrtrack has the pre-training weight on ImageNet, the ImageNet pre-training weight is not fully loaded after the GCE module is added. Therefore, the experimental result of adding the GCE module is not as good as that of the CenterTrack, but the PS correlation algorithm still shows strong performance. In a combined way, the overall performance of the model provided by the invention is better than that of a centrtrack. Furthermore, for a more fair comparison, the present invention conducted a comparative test of the tracker under the FRCNN public detector provided by the MOT17 dataset, as shown in table 3. Experiments show that the PS correlation algorithm has better performance compared with the CenterTrack central point matching algorithm.

TABLE 2

TABLE 3 Table 3

D. The experiment sets the track buffer time length N to 30 frames, and the evaluation result on the MOT17 test set is shown in table 4, wherein the evaluation result is improved by 0.6% on the index MOTA and 0.4% on the index HOTA by using the PS correlation algorithm only. There are also good performance in indexes such as FN, ID Sw, rcall, etc. In PSTrack using GCE modules, the final performance is not as good as PSTrack without GCE modules, but the final tracking performance is still stronger than CenterTrack, since the ImageNet pre-training weights are not fully loaded at training time.

TABLE 4 Table 4

The whole beneficial effects are that:

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention.

Claims

1. A multi-target tracking method based on point-plane matching association is characterized by comprising the following steps of,

2. The multi-objective tracking method based on point-plane matching correlation according to claim 1, wherein the constructing a detector according to the feature extraction network comprises modifying the feature extraction network, the modifying comprises using a deep aggregation network as a backbone network of the feature extraction network, the feature extraction network uses a deformable convolution network for feature extraction, uses a deep aggregation upsampling module for feature fusion, and adds a global context information extraction module to the feature extraction network, and uses the modified feature extraction network as the detector.

3. The method of claim 1, wherein modeling the object to be matched and the object in the t-th video frame by two-dimensional Gaussian distribution comprises modeling according to formulas (1) and (2),

4. The multi-target tracking method based on point-plane matching correlation according to claim 1, wherein the performing similarity matching on the low confidence targets in the (t+1) th video frame according to the similarity matrix comprises setting a similarity threshold to obtain a similarity matrix S _nm N represents the number of low confidence targets in the (t+1) th video frame, m represents the number of target tracks of the (t) th video frame, when S _ij When the similarity threshold is not satisfied, the ith target is not related to the jth item target track, and S is _ij Set to 0, otherwise, not modify S _ij I represents the ith target in the (t+1) th video frame, j represents the jth item mark track in the (t) th video frame, i is less than or equal to n, j is less than or equal to m, and S is updated _nm Inputting into greedy algorithm, obtaining the j-th item target track most relevant to the i-th target。

5. The method of claim 1, wherein dividing the targets in the t+1st video frame into high-confidence targets and low-confidence targets comprises setting a tracking threshold, acquiring all targets in the t+1st video frame, calculating the confidence according to the pixel values of the targets, and taking the targets as high-confidence targets when the confidence of the targets exceeds the tracking threshold, and vice versa.