CN116563345A - Multi-target tracking method based on point-surface matching association - Google Patents

Multi-target tracking method based on point-surface matching association Download PDF

Info

Publication number
CN116563345A
CN116563345A CN202310600166.6A CN202310600166A CN116563345A CN 116563345 A CN116563345 A CN 116563345A CN 202310600166 A CN202310600166 A CN 202310600166A CN 116563345 A CN116563345 A CN 116563345A
Authority
CN
China
Prior art keywords
targets
target
video frame
confidence
video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310600166.6A
Other languages
Chinese (zh)
Inventor
于纯妍
潘锰
宋梅萍
于浩洋
张强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian Maritime University
Original Assignee
Dalian Maritime University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian Maritime University filed Critical Dalian Maritime University
Priority to CN202310600166.6A priority Critical patent/CN116563345A/en
Publication of CN116563345A publication Critical patent/CN116563345A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/62Extraction of image or video features relating to a temporal dimension, e.g. time-based feature extraction; Pattern tracking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a multi-target tracking method based on point-plane matching association, which comprises the steps of acquiring a video to be detected, serializing, constructing a detector according to a feature extraction network, sequentially carrying out target detection on video frames in a video frame sequence according to the detector, acquiring the t-th and t+1th video frames from the video frame sequence, dividing targets in the t+1th video frame into high-confidence targets and low-confidence targets, carrying out first association matching on the high-confidence targets according to a target frame similarity perception algorithm, adding successfully matched targets into target tracks of the t-th video frame, respectively initializing target tracks for failed targets, carrying out second association matching on the low-confidence targets, adding the successfully matched targets into target tracks of the t-th video frame, discarding the failed targets, and sequentially traversing all video frames until all target tracks are acquired. The detection confidence of the blocked target is improved, and meanwhile, the accuracy of multi-target track association is improved.

Description

Multi-target tracking method based on point-surface matching association
Technical Field
The invention relates to the field of multi-target tracking, in particular to a multi-target tracking method based on point-surface matching association.
Background
Visual object tracking is a widely used computer vision task that unlike object classification and object detection, object tracking can be divided into two tasks according to different scenarios: 1) Single target tracking task (VOT); 2) Multiple target tracking task (MOT). The single target tracking is to continuously mark the target in a video sequence according to any target category given in a first frame and in a subsequent sequence; multi-object tracking is to detect and track objects that may appear or disappear in a video sequence of known object categories, and requires marking IDs for the tracked objects.
Early tracking schemes often tracked the point of interest in space-time, trackers were often relatively simple, and were fast and stable for applications in simple scenarios with slow linear motion, rare targets, etc. However, in the case of encountering a low-level cue such as an image edge portion, the tracking effect is very poor. With the development of computer vision, many detection-based object tracking algorithms are becoming popular. The method mainly relies on a target detection model to detect targets, and needs to perform data correlation on the targets of the front frame and the rear frame to connect the motion tracks between the frames, and the two-stage tracking strategy of detection before tracking is currently a mainstream multi-target tracking mode. The detection-based tracking paradigm mostly takes advantage of the high performance of deep-learning-based target detectors, but the data correlation algorithm of trackers remains slow and complex.
Disclosure of Invention
The invention provides a multi-target tracking method based on point-plane matching association, which aims to overcome the technical problems.
A multi-target tracking method based on point-plane matching association comprises the following steps,
step one, obtaining a video to be detected, carrying out serialization processing on the video to obtain a video frame sequence,
step two, constructing a detector according to the characteristic extraction network, sequentially carrying out target detection on video frames in the video frame sequence according to the detector and marking all detected targets,
step three, let t=1, initialize the target track according to the detected target in the first video frame, initialize the buffer pool, the buffer pool is used for storing the target track set,
step four, acquiring the t and t+1th video frames from the video frame sequence, dividing the targets in the t+1th video frame into high-confidence targets and low-confidence targets, carrying out first correlation matching on the high-confidence targets according to a target frame similarity perception algorithm, adding the successfully matched targets into target tracks of the t video frame, respectively initializing target tracks, storing the target tracks in a buffer pool, carrying out second correlation matching on the low-confidence targets, adding the successfully matched targets into target tracks of the t video frame, discarding the targets with the failed matching, wherein the second correlation matching comprises the steps of acquiring the low-confidence targets and representing the low-confidence targets as the targets to be matched, respectively modeling the targets to be matched and the targets in the t video frame through two-dimensional Gaussian distribution, acquiring Gaussian distribution probability of each target after modeling, carrying out normalization processing on the distance matrixes according to two-dimensional Gaussian distribution probability of all the targets in the t+1th video frame, respectively updating the confidence matrixes according to the confidence matrixes of the t+1th video frame, carrying out similarity updating on the targets in the t+1th video frame when the confidence matrixes meet the similarity conditions of the t+1st video frame, and the low-confidence matrixes in the t+1st video frame, respectively updating the confidence matrixes when the t+1st video frames in the low-confidence frames have the low-confidence matrixes, and the t+1st video frames have the similarity matching conditions that the targets are satisfied,
and fifthly, judging whether the t video frame is the last video frame in the video frame sequence, if so, acquiring all target tracks, otherwise, enabling t=t+1, and re-executing the fourth step.
Preferably, the constructing the detector according to the feature extraction network includes modifying the feature extraction network, the modifying includes using a deep aggregation network as a backbone network of the feature extraction network, the feature extraction network uses a deformable convolution network to perform feature extraction, uses a deep aggregation upsampling module to perform feature fusion, and adds a global context information extraction module to the feature extraction network, and uses the modified feature extraction network as the detector.
Preferably, said modeling of the object to be matched and the object in the t-th video frame by two-dimensional gaussian distribution respectively comprises modeling according to formulas (1), (2),
wherein (a, b) represents the center coordinates of the target, w, h represents the half axis of the center of the target along the x-axis and the half axis along the y-axis, respectively, x, μ, Σ represents the coordinates (x, y) of the gaussian distribution, the mean vector and the covariance matrix, respectively, such that the mean vector μ= [ a, b ]]Covariance matrix Σ=diag [ w ] 2 /4,h 2 /4]The target follows a two-dimensional gaussian distribution N (μ, Σ), denoted B to N (μ, Σ).
Preferably, the similarity matching of the low confidence targets in the (t+1) th video frame according to the similarity matrix includes setting a similarity thresholdObtaining a similarity matrix S nm N represents the number of low confidence targets in the (t+1) th video frame, m represents the number of target tracks of the (t) th video frame, when S ij When the similarity threshold is not satisfied, the ith target is not related to the jth item target track, and S is ij Set to 0, otherwise, not modify S ij I represents the ith target in the (t+1) th video frame, j represents the jth item mark track in the (t) th video frame, i is less than or equal to n, j is less than or equal to m, and S is updated nm Inputting the target track to a greedy algorithm, and acquiring the j-th item target track most related to the i-th target.
Preferably, the dividing the targets in the t+1th video frame into the high confidence targets and the low confidence targets includes setting a tracking threshold, obtaining all the targets in the t+1th video frame, calculating the confidence according to the pixel values of the targets, and when the confidence of the targets exceeds the tracking threshold, taking the targets as the high confidence targets, otherwise, taking the targets as the low confidence targets.
The invention provides a multi-target tracking method based on point-plane matching association, which predicts the track of a target by measuring the normalized Gaussian similarity distance between the target and the track of the target in two adjacent video frames and improves the accuracy of multi-target track association. The method comprises the steps of constructing a detector, extracting global context information of low-level features before a feature extraction network, enhancing the expression capability of key texture information of the low-level features, improving the detection confidence of blocked targets, realizing secondary association matching and caching strategies of multi-target tracks, carrying out secondary matching on blocked targets or non-matched targets based on a PS association algorithm, and applying a caching mechanism to broken tracks to form long-term stable multi-target track association.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the drawings that are needed in the embodiments or the description of the prior art will be briefly described below, it will be obvious that the drawings in the following description are some embodiments of the present invention, and that other drawings can be obtained according to these drawings without inventive effort to a person skilled in the art.
FIG. 1 is a flow chart of the method of the present invention;
FIG. 2 is a schematic diagram of a one-stage point-to-face matching association of the present invention;
FIG. 3 is a flow chart of an embodiment of the present invention;
FIG. 4 is a graph of multi-objective tracking results in an embodiment of the invention;
FIG. 5 (a) is a graph showing the analysis of the values of MOTA, IDF1 indicators at the NWS similarity constant C of the invention;
FIG. 5 (b) is a graph showing the analysis of the values of the IDs index under the NWS similarity constant C of the present invention;
FIG. 6 (a) is a comparative test of MOTA indicator at buffer frame number value N according to the present invention;
FIG. 6 (b) is a comparative test of the IDF1 index at the buffer frame number value N according to the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
FIG. 1 is a flowchart of the method of the present invention, as shown in FIG. 1, the method of the present embodiment may include:
step one, obtaining a video to be detected, carrying out serialization processing on the video to obtain a video frame sequence,
step two, constructing a detector according to the characteristic extraction network, sequentially carrying out target detection on video frames in the video frame sequence according to the detector and marking all detected targets,
step three, let t=1, initialize the target track according to the detected target in the first video frame, initialize the buffer pool, the buffer pool is used for storing the target track set,
step four, acquiring the t and t+1th video frames from the video frame sequence, dividing the targets in the t+1th video frame into high-confidence targets and low-confidence targets, carrying out first correlation matching on the high-confidence targets according to a target frame similarity perception algorithm, adding the successfully matched targets into target tracks of the t video frame, respectively initializing target tracks, storing the target tracks in a buffer pool, carrying out second correlation matching on the low-confidence targets, adding the successfully matched targets into target tracks of the t video frame, discarding the targets with the failed matching, wherein the second correlation matching comprises the steps of acquiring the low-confidence targets and representing the low-confidence targets as the targets to be matched, respectively modeling the targets to be matched and the targets in the t video frame through two-dimensional Gaussian distribution, acquiring Gaussian distribution probability of each target after modeling, carrying out normalization processing on the distance matrixes according to two-dimensional Gaussian distribution probability of all the targets in the t+1th video frame, respectively updating the confidence matrixes according to the confidence matrixes of the t+1th video frame, carrying out similarity updating on the targets in the t+1th video frame when the confidence matrixes meet the similarity conditions of the t+1st video frame, and the low-confidence matrixes in the t+1st video frame, respectively updating the confidence matrixes when the t+1st video frames in the low-confidence frames have the low-confidence matrixes, and the t+1st video frames have the similarity matching conditions that the targets are satisfied,
and fifthly, judging whether the t video frame is the last video frame in the video frame sequence, if so, acquiring all target tracks, otherwise, enabling t=t+1, and re-executing the fourth step.
Based on the scheme, the track of the target is predicted by measuring the normalized Gaussian Wasserstein similarity distance between the target and the track of the target in two adjacent video frames, so that the accuracy of multi-target track association is improved. The method comprises the steps of constructing a detector, extracting global context information of low-level features before a feature extraction network, enhancing the expression capability of key texture information of the low-level features, improving the detection confidence of blocked targets, realizing secondary association matching and caching strategies of multi-target tracks, carrying out secondary matching on blocked targets or non-matched targets based on a PS association algorithm, and applying a caching mechanism to broken tracks to form long-term stable multi-target track association.
Step one, obtaining a video to be detected, carrying out serialization processing on the video to obtain a video frame sequence,
step two, a detector is constructed according to a feature extraction network, global attention information of low-level features is extracted before the feature extraction network, the interesting features are optimized by using rich texture information and channel information, the detector is constructed according to the feature extraction network, the modification comprises the steps of taking a deep aggregation network as a main network of the feature extraction network, the feature extraction network adopts a deformable convolution network to perform feature extraction, adopts a deep aggregation up-sampling module to perform feature fusion, a global context information extraction module GCE is added into the feature extraction network, the modified feature extraction network is taken as the detector,
the feature extraction network adopts a deep aggregation network DLA as a backbone network to extract multi-level features. The DLA backbone network adds more jump connections similar to the feature pyramid network FPN between the low-level features and the high-level features, adopts an iterative deep aggregation up-sampling module IDAUp in the fusion part, and replaces the common convolution network with a deformable convolution network DCN capable of dynamically adjusting the receptive field according to the target size to fuse the multi-scale features extracted by the DLA.
Input: the feature extraction network allows an input image size of 3 x H x W, a total downsampling rate of the DLA backbone network of 32,
and (3) outputting: after passing through the IDAUp fusion up-sampling network, a 4-times down-sampling feature map is finally output, and the size of the feature map is 64 XH/4 XW/4.
The attention mechanism adopted by the GCE comprises a space attention part and a channel attention part, and the interest points are simultaneously focused through the space attention part and the channel attention part, so that the feature extraction network is allowed to focus on more important features in the input feature map F, and the network base layer is helped to better process the space information and the channel information of the input data and extract related features.
The formula for GCE: as in formula (1):
wherein MP and AP are respectively a maximum pooling layer and an average pooling layer, W is a convolution layer weight, sigma is a Sigmoid activation function, GMP and GAP are respectively a global maximum pooling layer and a global average pooling layer, MLP is a multi-layer perceptron,representing element-by-element product, A s Representing spatial attention, A c Representing channel attention.
Sequentially carrying out target detection on video frames in a video frame sequence according to a detector, marking all detected targets, and detecting target frame information size= (a, b, w, h), (a, b) of all targets of a t-th frame in the video by the detector, wherein (w, h) is a target center coordinate, and (w) is the width and the height of a target frame where the targets are located;
step three, let t=1, initialize the target track according to the detected target in the first video frame, the initialized target track is a set storing the position information of the target in the video frame, the video frame information and the like with the same target number, specifically including a target number, a target center coordinate, a target length, a target width, a buffer duration and a frame sequence number set, the buffer duration is the duration that the target appears in the video, the frame sequence number set is the sequence number set of all video frames when the target appears in the video frame, initialize a buffer pool, the buffer pool is used for storing the target track set,
step four, acquiring the t and t+1th video frames from the video frame sequence, dividing the targets in the t+1th video frame into high-confidence targets and low-confidence targets, wherein the dividing the targets in the t+1th video frame into the high-confidence targets and the low-confidence targets comprises setting a tracking threshold value, acquiring all targets in the t+1th video frame, calculating the confidence coefficient according to the pixel values of the targets, taking the targets as the high-confidence targets when the confidence coefficient of the targets exceeds the tracking threshold value, otherwise, taking the targets as the low-confidence targets,
performing first-time association matching on the high-confidence targets according to a target frame similarity sensing algorithm, adding the successfully matched targets into target tracks of the t-th video frame, respectively initializing target tracks of the targets which are failed to be matched, and storing the target tracks in a cache pool, wherein the performing first-time association matching according to the target frame similarity sensing algorithm comprises calculating normalized Gaussian Wasserstein similarity between all the high-confidence targets and all the target track frames of the previous frame, performing one-to-one matching according to the maximum value of similarity scores between the high-confidence targets, and assigning the same target numbers for the matched high-confidence targets, wherein the high-confidence targets which are not matched are assigned with new target numbers.
Performing a second correlation matching on the low confidence level target, adding the target successfully matched into the target track of the t-th video frame, discarding the target unsuccessfully matched, wherein the second correlation matching comprises the steps of acquiring the target with low confidence level and representing the target as the target to be matched, respectively modeling the target to be matched and the target in the t-th video frame through two-dimensional Gaussian distribution,
modeling the target to be matched and the target in the t-th video frame respectively through two-dimensional Gaussian distribution comprises modeling according to formulas (2) and (3),
wherein (a, b) represents the center coordinates of the target, w, h represents the half axis of the center of the target along the x-axis and the half axis along the y-axis, respectively, x, μ, Σ represents the coordinates (x, y) of the gaussian distribution, the mean vector and the covariance matrix, respectively, such that the mean vector μ= [ a, b ]]Covariance matrix Σ=diag [ w ] 2 /4,h 2 /4]The targets follow a two-dimensional Gaussian distribution N (mu, sigma), denoted as B-N (mu, sigma), the detected targets are remodelled by adopting the two-dimensional Gaussian distribution to obtain probability density of each target two-dimensional Gaussian distribution, and the distribution weight (the weight refers to the probability of the two-dimensional Gaussian distribution) of the foreground pixels from the center of the target to the edge of the target frame is gradually reduced.
Acquiring two-dimensional Gaussian distribution probability of each target after modeling, constructing a distance matrix according to the two-dimensional Gaussian distribution probability of all targets in the t-th and t+1th video frames,
the construction distance matrix specifically comprises,
the two-dimensional Gaussian distribution of the current frame target 1 is N 111 ) The two-dimensional gaussian distribution of the previous frame object 2 is N 222 ) The Wasserstein distance between the two is described as equation (4).
Simplification can yield equation (5):
wherein I II 2 Representing 2-normal form, I p Representing the p-norm. According to the above formula, bring into the target frame B 1 =(a 1 ,b 1 ,w 1 ,h 1 ),μ 1 =[a 1 ,b 1 ],Σ 1 =diag[w 1 2 /4,h 1 2 /4]Target frame B 2 =(a 2 ,b 2 ,w 2 ,h 2 ),μ 2 =[a 2 ,b 2 ],Σ 2 =diag[w 2 2 /4,h 2 2 /4]Obtaining a formula (6):
where W represents the Wasserstein distance matrix. The adjacent video W matrix is as in equation (7):
where m represents the number of target tracks of the previous frame (including all detected targets and non-matched targets, i.e. all tracks in the buffer pool), n represents the number of targets of the current frame, and d represents the wasperstein distance between a certain target track of the previous frame and the detected target of the current frame.
The distance matrix is normalized to obtain a similarity matrix of the (t+1) th video frame,
the construction of the normalized similarity matrix specifically comprises the construction of a measurement method for evaluating the similarity between targets based on the Wasserstein distance of Gaussian distribution, which is called a point-plane matching association algorithm and is marked as a PS algorithm,
because the value range is not in the range of [0,1], W cannot be directly used as a similarity measurement matrix for measuring the two-dimensional Gaussian distribution of the target frame. However, the normalized gaussian similarity matrix, abbreviated as normalized waserstein similarity (Normalized Wasserstein Similarity, NWS), may be obtained by normalizing W in the form of a natural exponent, as in equation (8).
Where C is a constant, which is primarily related to the target size of the dataset, a larger value should be given to the constant C as the pixel area of the target in the image increases. NWS matrix as in equation (9):
respectively carrying out similarity matching on low confidence coefficient targets in the (t+1) th video frame according to the similarity matrixAs shown in fig. 2, the performing similarity matching on the low confidence targets in the (t+1) th video frame according to the similarity matrix respectively includes setting a similarity threshold to obtain a similarity matrix S nm N represents the number of targets in the (t+1) th video frame, m represents the number of target tracks of the (t) th video frame, when S ij When the similarity threshold is not satisfied, the ith target is not related to the jth item target track, and S is ij Set to 0, otherwise, not modify S ij I represents the ith target in the (t+1) th video frame, j represents the jth item mark track in the (t) th video frame, i is less than or equal to n, j is less than or equal to m, and S is updated nm Inputting the target tracks into a greedy algorithm to obtain a j-th target track most relevant to an i-th target, specifically, measuring the similarity between target frames by using a PS algorithm based on normalized Wasserstein similarity in the data association process, further converting the W matrix into an NWS matrix capable of carrying out similarity measurement by calculating a W distance matrix of all target tracks and all detected targets of the current frame, and giving a similarity threshold sigma NWS Removing less than sigma NWS And (3) finding the basis of the track association of all the matching results with the highest similarity in the NWS matrix through a greedy algorithm.
Adding the low confidence coefficient target in the t+1th video frame into the target track of the t video frame when the low confidence coefficient target in the t+1th video frame meets the similarity matching condition, storing the updated target track of the t video frame in a buffer pool, discarding the low confidence coefficient target in the t+1th video frame when the low confidence coefficient target in the t+1th video frame does not meet the matching condition,
and fifthly, judging whether the t video frame is the last video frame in the video frame sequence, if so, acquiring all target tracks, otherwise, enabling t=t+1, and re-executing the fourth step.
In the embodiment, all detected targets are divided into high-confidence targets and low-confidence targets according to a detection threshold, the high-score targets are matched with tracks in a cache pool, and second-time association matching is carried out on the unmatched target tracks and the low-score targets; reserving the final unmatched tracks in a cache pool, as shown in fig. 3; specifically, S51 is a two-dimensional Gaussian distribution similarity measure of a target frame
When the detection module of the detector outputs a Sigmoid normalized heat map, all pixel points represent the confidence of the target and the detection threshold value theta is used for detecting the target det Confidence is greater than theta det Is preserved, other confidence levels are less than theta det Is set to 0.
Setting a tracking threshold value theta when correlating the target track track ∈(θ det 1) target confidence in [ θ ] track The center position and bounding box information of the high confidence targets of the 1) range are modeled as a two-dimensional gaussian distribution.
S52, correlation matching of high confidence targets
For the first frame, the track is initialized for all detected targets.
For subsequent frames, a target frame similarity sensing algorithm is used to calculate normalized gaussian Wasserstein similarity between all predicted frames and the previous frame track frame. And (3) carrying out one-to-one matching according to the maximum value of the similarity score between the two, distributing the same ID for the matched prediction frames, endowing new IDs for the unmatched prediction frames, generating new tracks, and retaining unmatched tracks.
S53, second matching of unmatched tracks
Obtaining unmatched tracks, and ensuring that the confidence coefficient between the tracks and the target is [ theta ] dettrack ) All prediction frames in the interval are matched by using a PS algorithm, the prediction frames which are not matched currently are cleared, and the rest of unmatched historical tracks are reserved.
S54, trace state caching mechanism
Updating all track states (including the center coordinates of the target, size information, cache duration and activity state) and adding the track states into a cache pool. Resetting the cache time length of the matched track, marking the cache time length as an active state, increasing the cache time length of the unmatched track, and marking the cache time length as an inactive state.
The present invention was validated by MOT17 dataset provided by MOT Challenge, in the specific procedure,
A. data sources: the multi-target tracking dataset is derived from the MOT17 dataset provided by MOT Challenge. Which contains 7 training sequences and 7 test sequences, each providing 3 sets of detectors DPM, fast-RCNN and SDP, 21 training sequences and 21 test sequences. The video frame rate is 25-30 frames/second.
Table 1 shows the MOT17 dataset video sequence names and corresponding image sizes and sequence lengths:
TABLE 1
B. For this dataset, the first half of each training sequence is taken as the training set and the second half is taken as the validation set.
C. This section is mainly directed to performing an ablation experiment on a verification set of the MOT17 dataset by the global context information extraction module GCE and the PS correlation algorithm with the secondary correlation matching policy and buffering mechanism, as shown in table 2. The process of performing target detection is shown in fig. 4. MOT17_half represents pre-training on a CrowdHuman data set, training a model on a MOT17 semi-training set, and finally performing index evaluation on a MOT17 verification set; only CrowdH means model weights are used to perform index evaluation on MOT17 validation set using Only weights trained on CrowdHuman data; the Scratch represents training on MOT17 half training set only, testing on MOT17 validation set.
Experiments show that the GCE module and PS correlation algorithm provided by the invention can obviously improve a target tracking model of a CenterTrack at one stage. The multi-target tracking accuracy (Multiple Object Tracking Accuracy, MOTA) index in mot17_half and Only crowdh. Experiments increased by 1.6% and 3.2%, respectively, and the recognition detection F1 score (Identified Detections F-score, IDF 1) increased by 6.7% and 5.2%, respectively. The curves of the influence of C on different indexes are shown in fig. 5 (a) and 5 (b), and the comparison result of the influence of the centrtrack and the PSTrack on the MOTA indexes under different N is shown in fig. 6 (a) and 6 (b), in the Scratch experiment, since the feature extraction network of the centrtrack has the pre-training weight on ImageNet, the ImageNet pre-training weight is not fully loaded after the GCE module is added. Therefore, the experimental result of adding the GCE module is not as good as that of the CenterTrack, but the PS correlation algorithm still shows strong performance. In a combined way, the overall performance of the model provided by the invention is better than that of a centrtrack. Furthermore, for a more fair comparison, the present invention conducted a comparative test of the tracker under the FRCNN public detector provided by the MOT17 dataset, as shown in table 3. Experiments show that the PS correlation algorithm has better performance compared with the CenterTrack central point matching algorithm.
TABLE 2
TABLE 3 Table 3
D. The experiment sets the track buffer time length N to 30 frames, and the evaluation result on the MOT17 test set is shown in table 4, wherein the evaluation result is improved by 0.6% on the index MOTA and 0.4% on the index HOTA by using the PS correlation algorithm only. There are also good performance in indexes such as FN, ID Sw, rcall, etc. In PSTrack using GCE modules, the final performance is not as good as PSTrack without GCE modules, but the final tracking performance is still stronger than CenterTrack, since the ImageNet pre-training weights are not fully loaded at training time.
TABLE 4 Table 4
The whole beneficial effects are that:
the invention provides a multi-target tracking method based on point-plane matching association, which predicts the track of a target by measuring the normalized Gaussian similarity distance between the target and the track of the target in two adjacent video frames and improves the accuracy of multi-target track association. The method comprises the steps of constructing a detector, extracting global context information of low-level features before a feature extraction network, enhancing the expression capability of key texture information of the low-level features, improving the detection confidence of blocked targets, realizing secondary association matching and caching strategies of multi-target tracks, carrying out secondary matching on blocked targets or non-matched targets based on a PS association algorithm, and applying a caching mechanism to broken tracks to form long-term stable multi-target track association.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention.

Claims (5)

1. A multi-target tracking method based on point-plane matching association is characterized by comprising the following steps of,
step one, obtaining a video to be detected, carrying out serialization processing on the video to obtain a video frame sequence,
step two, constructing a detector according to the characteristic extraction network, sequentially carrying out target detection on video frames in the video frame sequence according to the detector and marking all detected targets,
step three, let t=1, initialize the target track according to the detected target in the first video frame, initialize the buffer pool, the buffer pool is used for storing the target track set,
step four, acquiring the t and t+1th video frames from the video frame sequence, dividing the targets in the t+1th video frame into high-confidence targets and low-confidence targets, carrying out first correlation matching on the high-confidence targets according to a target frame similarity perception algorithm, adding the successfully matched targets into target tracks of the t video frame, respectively initializing target tracks, storing the target tracks in a buffer pool, carrying out second correlation matching on the low-confidence targets, adding the successfully matched targets into target tracks of the t video frame, discarding the targets with the failed matching, wherein the second correlation matching comprises the steps of acquiring the low-confidence targets and representing the low-confidence targets as the targets to be matched, respectively modeling the targets to be matched and the targets in the t video frame through two-dimensional Gaussian distribution, acquiring Gaussian distribution probability of each target after modeling, carrying out normalization processing on the distance matrixes according to two-dimensional Gaussian distribution probability of all the targets in the t+1th video frame, respectively updating the confidence matrixes according to the confidence matrixes of the t+1th video frame, carrying out similarity updating on the targets in the t+1th video frame when the confidence matrixes meet the similarity conditions of the t+1st video frame, and the low-confidence matrixes in the t+1st video frame, respectively updating the confidence matrixes when the t+1st video frames in the low-confidence frames have the low-confidence matrixes, and the t+1st video frames have the similarity matching conditions that the targets are satisfied,
and fifthly, judging whether the t video frame is the last video frame in the video frame sequence, if so, acquiring all target tracks, otherwise, enabling t=t+1, and re-executing the fourth step.
2. The multi-objective tracking method based on point-plane matching correlation according to claim 1, wherein the constructing a detector according to the feature extraction network comprises modifying the feature extraction network, the modifying comprises using a deep aggregation network as a backbone network of the feature extraction network, the feature extraction network uses a deformable convolution network for feature extraction, uses a deep aggregation upsampling module for feature fusion, and adds a global context information extraction module to the feature extraction network, and uses the modified feature extraction network as the detector.
3. The method of claim 1, wherein modeling the object to be matched and the object in the t-th video frame by two-dimensional Gaussian distribution comprises modeling according to formulas (1) and (2),
wherein (a, b) represents the center coordinates of the target, w, h represents the half axis of the center of the target along the x-axis and the half axis along the y-axis, respectively, x, μ, Σ represents the coordinates (x, y) of the gaussian distribution, the mean vector and the covariance matrix, respectively, such that the mean vector μ= [ a, b ]]Covariance matrix Σ=diag [ w ] 2 /4,h 2 /4]The target follows a two-dimensional gaussian distribution N (μ, Σ), denoted B to N (μ, Σ).
4. The multi-target tracking method based on point-plane matching correlation according to claim 1, wherein the performing similarity matching on the low confidence targets in the (t+1) th video frame according to the similarity matrix comprises setting a similarity threshold to obtain a similarity matrix S nm N represents the number of low confidence targets in the (t+1) th video frame, m represents the number of target tracks of the (t) th video frame, when S ij When the similarity threshold is not satisfied, the ith target is not related to the jth item target track, and S is ij Set to 0, otherwise, not modify S ij I represents the ith target in the (t+1) th video frame, j represents the jth item mark track in the (t) th video frame, i is less than or equal to n, j is less than or equal to m, and S is updated nm Inputting into greedy algorithm, obtaining the j-th item target track most relevant to the i-th target。
5. The method of claim 1, wherein dividing the targets in the t+1st video frame into high-confidence targets and low-confidence targets comprises setting a tracking threshold, acquiring all targets in the t+1st video frame, calculating the confidence according to the pixel values of the targets, and taking the targets as high-confidence targets when the confidence of the targets exceeds the tracking threshold, and vice versa.
CN202310600166.6A 2023-05-25 2023-05-25 Multi-target tracking method based on point-surface matching association Pending CN116563345A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310600166.6A CN116563345A (en) 2023-05-25 2023-05-25 Multi-target tracking method based on point-surface matching association

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310600166.6A CN116563345A (en) 2023-05-25 2023-05-25 Multi-target tracking method based on point-surface matching association

Publications (1)

Publication Number Publication Date
CN116563345A true CN116563345A (en) 2023-08-08

Family

ID=87487868

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310600166.6A Pending CN116563345A (en) 2023-05-25 2023-05-25 Multi-target tracking method based on point-surface matching association

Country Status (1)

Country Link
CN (1) CN116563345A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117173221A (en) * 2023-09-19 2023-12-05 浙江大学 Multi-target tracking method based on authenticity grading and occlusion recovery

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117173221A (en) * 2023-09-19 2023-12-05 浙江大学 Multi-target tracking method based on authenticity grading and occlusion recovery
CN117173221B (en) * 2023-09-19 2024-04-19 浙江大学 Multi-target tracking method based on authenticity grading and occlusion recovery

Similar Documents

Publication Publication Date Title
Hassaballah et al. Vehicle detection and tracking in adverse weather using a deep learning framework
CN111401201B (en) Aerial image multi-scale target detection method based on spatial pyramid attention drive
CN113378632B (en) Pseudo-label optimization-based unsupervised domain adaptive pedestrian re-identification method
CN111709311B (en) Pedestrian re-identification method based on multi-scale convolution feature fusion
CN111696128B (en) High-speed multi-target detection tracking and target image optimization method and storage medium
CN108447080B (en) Target tracking method, system and storage medium based on hierarchical data association and convolutional neural network
Lei et al. Region-enhanced convolutional neural network for object detection in remote sensing images
CN108520203B (en) Multi-target feature extraction method based on fusion of self-adaptive multi-peripheral frame and cross pooling feature
CN110942471A (en) Long-term target tracking method based on space-time constraint
CN111523553A (en) Central point network multi-target detection method based on similarity matrix
CN112738470B (en) Method for detecting parking in highway tunnel
CN112052802A (en) Front vehicle behavior identification method based on machine vision
CN104778699B (en) A kind of tracking of self adaptation characteristics of objects
CN114898403A (en) Pedestrian multi-target tracking method based on Attention-JDE network
CN116563345A (en) Multi-target tracking method based on point-surface matching association
Wang et al. MLFFNet: Multilevel feature fusion network for object detection in sonar images
CN113963333B (en) Traffic sign board detection method based on improved YOLOF model
Wu et al. Single shot multibox detector for vehicles and pedestrians detection and classification
CN112800967B (en) Posture-driven shielded pedestrian re-recognition method
CN116883457B (en) Light multi-target tracking method based on detection tracking joint network and mixed density network
Walia et al. A novel approach of multi-stage tracking for precise localization of target in video sequences
Sharma et al. HistoNet: Predicting size histograms of object instances
CN110147768B (en) Target tracking method and device
CN114627339B (en) Intelligent recognition tracking method and storage medium for cross border personnel in dense jungle area
CN113642520B (en) Double-task pedestrian detection method with head information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination