CN114972417B - Multi-target tracking method for dynamic track quality quantification and feature re-planning - Google Patents

Multi-target tracking method for dynamic track quality quantification and feature re-planning Download PDF

Info

Publication number
CN114972417B
CN114972417B CN202210343596.XA CN202210343596A CN114972417B CN 114972417 B CN114972417 B CN 114972417B CN 202210343596 A CN202210343596 A CN 202210343596A CN 114972417 B CN114972417 B CN 114972417B
Authority
CN
China
Prior art keywords
track
current frame
frame
detection
state
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210343596.XA
Other languages
Chinese (zh)
Other versions
CN114972417A (en
Inventor
孔军
张元澍
蒋敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangnan University
Original Assignee
Jiangnan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangnan University filed Critical Jiangnan University
Priority to CN202210343596.XA priority Critical patent/CN114972417B/en
Publication of CN114972417A publication Critical patent/CN114972417A/en
Application granted granted Critical
Publication of CN114972417B publication Critical patent/CN114972417B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30241Trajectory

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a multi-target tracking method based on a dynamic track quantization strategy and feature re-planning. The method adopts a multi-target tracking framework based on a combined detection and tracking range. Since most algorithms initialize unmatched detection frames as tracks and terminate unmatched tracks exceeding a threshold number of frames, and ignore differences between tracks of different quality when processing the beginning and ending of the tracks, the invention provides a dynamic track quality quantization strategy, which explicitly characterizes the quality of each track by dynamically scoring the track and adopts different dynamic updating mechanisms according to different matching results. In addition, aiming at the problem of conflict between detection and tracking of two subtasks in a combined detection and tracking model, the invention designs a channel enhanced feature re-planning module, which drives the two subtasks to learn different features respectively, improves the suitability of the features and provides more accurate detection results for a dynamic track quantization strategy.

Description

Multi-target tracking method for dynamic track quality quantification and feature re-planning
Technical Field
The present invention relates to the field of machine vision, and in particular, to a multi-target tracking method, apparatus, and computer storage medium.
Background
With extensive research in machine vision both theoretically and practically, multi-objective tracking is also becoming one of the important branches. Due to the diversity of objective environments and the subjective complexity of target behavior, multi-target tracking has many problems to be solved. At present, multi-target tracking is mainly divided into two modes of detection-before-tracking and combined detection and tracking.
Multi-target tracking studies are mostly based on a detection-first-tracking paradigm when the combined detection and tracking paradigm has not yet emerged. These algorithms divide the multi-target tracking task into two separate subtasks, detection and tracking, and models are also independent of each other. Although the effect is not popular, the calculation cost is high, subtasks cannot be optimized together, and the balance between accuracy and speed is difficult to obtain. In comparison, in the combined detection and tracking paradigm, two subtasks are fused into a unified network, algorithm complexity brought by staged processing is reduced, meanwhile, coupling degree between functional modules is increased, higher precision and better balance are achieved, in the combined detection and tracking paradigm, attention of a data association module in multi-target tracking is less, a currently mainstream algorithm pursues a more accurate detection result, discriminative appearance embedding is achieved, diversity of tracks is ignored, tracks with different qualities are processed in the same mode, so that tracks with too low quality interfere with matching links, and tracks with high quality cannot participate in matching for more times. However, in the combined detection and tracking mode, the diversity of the tracks is ignored, tracks with different qualities are treated equally, and when the lost frame number reaches a threshold value, the tracks are discarded regardless of the quality of the tracks, so that the accuracy of multi-target tracking is affected.
Disclosure of Invention
Therefore, the technical problem to be solved by the invention is to solve the problems that in the prior art, track diversity is ignored, and tracks with different quality are discarded when the loss time reaches a threshold value, so that the detection accuracy is affected.
In order to solve the technical problems, the invention provides a multi-target tracking method, which comprises the following steps:
the method comprises the steps of obtaining an original frame image of a video, inputting the original frame image into a backbone network, and outputting pedestrian characteristics;
calculating according to the pedestrian characteristics to obtain a current frame detection frame and a corresponding current frame appearance embedded vector;
judging whether the current frame image is a first frame or not, if the current frame image is not the first frame, matching and associating a current frame detection frame and the corresponding current frame appearance embedded vector with a previous frame track and an appearance embedded vector updated along with the previous frame track;
if the matching is successful, calculating the track quantization score of the current frame according to the confidence coefficient of the detection frame of the current frame and the track quantization score of the previous frame, embedding the appearance of the current frame for updating, judging the track state, marking the track of the activation state which is successfully matched as a tracking state, and marking the track state which is temporarily lost and successfully matched as an activation state;
if the matching fails, subtracting a preset constant from the previous frame track quantization score to obtain the current frame track quantization score, wherein the appearance embedding is not updated, marking the current frame detection frame with failed matching as a new inactive state track, resetting the current frame track state according to the current frame track quantization score and a preset threshold value, continuing to match, and discarding the current frame track when the current frame quantization score is smaller than the preset threshold value;
and carrying out the operation on the next frame of image until the video is finished.
Preferably, the calculating the current frame detection frame and the corresponding current frame appearance embedded vector according to the pedestrian feature includes:
inputting the pedestrian characteristics into a channel enhancement characteristic re-planning module which is trained by an Adam algorithm together with the overall model, and adaptively dividing the pedestrian characteristics into detection subtask characteristics and tracking subtask characteristics;
and respectively calculating the current frame detection frame and the corresponding current frame appearance embedded vector according to the detection subtask characteristics and the tracking subtask characteristics.
Preferably, the channel enhancement feature re-planning module after the pedestrian feature is input and trained together with the overall model by using Adam algorithm, and adaptively dividing the pedestrian feature into a detection subtask feature and a tracking subtask feature includes:
inputting the pedestrian characteristics into the channel enhancement characteristic re-planning module;
the pedestrian feature F t ∈R H×W×C The first characteristic F is obtained after two point-by-point convolutions q ∈R H×W×1 And second feature F v ∈R H×W×rC
By applying the first feature F q After softmax function, re-align with second feature F v Performing matrix multiplication to obtain a pedestrian feature vector V containing global information and channel dimension information cha
-extracting said pedestrian feature vector V cha After adopting two groups of convolution layer-normalization-ReLU activation function-channel confusion operations, the detection subtask feature vector V is obtained respectively det And the tracking subtask feature vector V id
Characterizing the pedestrian F t ∈R H×W×C The number of input channels is expanded by r times and comprises the residue of the convolutional layer-normalization-ReLU activation function-channel aliasing operationObtaining a reconstructed pedestrian characteristic F' after the difference branch;
the detection subtask feature vector V det And the tracking subtask feature vector V id After broadcasting, the corresponding multiplication of element level is carried out on the reconstructed pedestrian characteristic F' to obtain the detection subtask characteristic F required by the final subtask det And the tracking subtask feature F id
Preferably, the calculating the current frame detection frame according to the detection subtask features includes:
the detection subtask features include thermodynamic diagrams, offset branches, and size branches;
convolving the detection subtask features and performing ReLU activation operation to obtain thermodynamic diagram tensor O heatmap Offset branch tensor O offset And a size branch tensor O size
According to O heatmap ,O offset And O size Calculating the current frame detection frame D i ,i∈[1,…,M]M is the number of detection frames of the current frame.
Preferably, the calculating the current frame appearance embedded vector according to the tracking subtask features includes:
the tracking subtask features include appearance embedding branches;
convolving the tracking subtask features and performing ReLU activation operation to obtain an appearance embedded tensor O id
Detecting a frame D according to the current frame i Is at the center point of O id Corresponding to the corresponding position extraction of the current frame appearance embedded vector ED i ,i∈[1,…,M]。
Preferably, if the matching is successful, calculating the track quantization score of the current frame according to the confidence of the detection frame of the current frame and the track quantization score of the previous frame, and embedding and updating the appearance of the current frame includes:
when the matching is successful
Figure GDA0004175544050000041
The appearance is thatEmbedded ET t =(α-δ)×ET t-1 +(1-α+δ)×ED t
Wherein ST is track quantization score, SD is detection frame confidence, ET and ED are track and detection frame appearance embedding respectively,
Figure GDA0004175544050000042
is an influence factor obtained by calculating the confidence coefficient of the detection frame, and alpha is a constant.
Preferably, the marking the current frame detection frame with failed matching as a new inactive state track, resetting the current frame track state according to the current frame track quantization score and a preset threshold value, continuing matching, and discarding the current frame track when the current frame quantization score is smaller than the preset threshold value comprises:
marking the current frame detection frame with failed matching as a new inactive state track;
when the quantization score of the inactive state track is greater than or equal to a first preset threshold value, the inactive state track is activated to be an active state track;
when the matching of the active state track in a plurality of frames fails, but the quantization score is not lower than a second preset threshold value, the state is changed into a temporary lost state to continue the matching, and if the temporary lost state track is successfully matched again, the state is changed into the active state from the temporary lost state;
and when the quantization score of the temporary loss state track is smaller than a second preset threshold value, the target is considered to be disappeared in the video sequence, and the track is changed into a discarding state, so that the subsequent matching link is not participated.
The invention also provides a device for multi-target tracking, which comprises:
the feature extraction module is used for acquiring an original frame image of the video, inputting the original frame image into a backbone network and outputting pedestrian features;
the target detection and appearance extraction module is used for calculating and obtaining a current frame detection frame and a corresponding current frame appearance embedded vector according to the pedestrian characteristics;
the matching association module is used for judging whether the current frame image is a first frame or not, and if the current frame image is not the first frame, matching and associating a current frame detection frame and the corresponding current frame appearance embedded vector with a previous frame track and an appearance embedded vector updated along with the previous frame track;
the track updating module is used for calculating the track quantization score of the current frame according to the confidence coefficient of the current frame detection frame and the track quantization score of the previous frame if the matching is successful, embedding the appearance of the current frame for updating, judging the track state, marking the track of the activation state which is successfully matched as a tracking state, and marking the track state which is temporarily lost and successfully matched as an activation state;
the matching failure track updating module is used for subtracting a preset constant from the previous frame track quantization score to obtain the current frame track quantization score if the matching fails, embedding the current frame appearance into the current frame without updating, marking the current frame detection frame which fails to match as a new inactive state track, resetting the current frame track state according to the current frame track quantization score and a preset threshold value, continuing to match, and discarding the current frame track when the current frame quantization score is smaller than the preset threshold value;
and the ending judgment module is used for carrying out the operation on the next frame of image until the video ends.
The present invention also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor implements the steps of a multi-target tracking method as described above.
Compared with the prior art, the technical scheme of the invention has the following advantages:
the invention evaluates the quality of each track, adopts the quantization score to explicitly represent the quality of each track, dynamically updates the quantization score in each frame according to the difference of the matching results, and then determines the state of the track according to the updated track quantization score. The invention considers the diversity of the track, introduces the quality of the track and the detection result into the updating of the track, prolongs the duration of the high-quality track, increases the possibility of matching with the subsequent detection frame, and compared with the track with low quality, the track with low quality can be terminated in a shorter time, thereby reducing the interference to the matching link. Through the strategy, identity switching in matching is effectively reduced, so that tracking accuracy and target identity robustness are improved.
Drawings
In order that the invention may be more readily understood, a more particular description of the invention will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings, in which:
FIG. 1 is a flow chart of an implementation of the multi-target tracking method of the present invention;
FIG. 2 is a schematic diagram of a channel enhancement feature reprofiling CEFR module;
FIG. 3 is a diagram of an algorithm model of the present invention;
FIG. 4 is a flow chart of one embodiment of the present invention;
fig. 5 is a block diagram of a multi-target tracking apparatus according to an embodiment of the present invention.
Detailed Description
The core of the invention is to provide a method, a device and a computer storage medium for scoring tracks, prolonging the duration of high-quality tracks and improving the detection precision.
In order to better understand the aspects of the present invention, the present invention will be described in further detail with reference to the accompanying drawings and detailed description. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1, fig. 1 is a flowchart of an implementation of a multi-target tracking method provided by the present invention; the specific operation steps are as follows:
s101, acquiring an original frame image of a video, inputting the original frame image into a backbone network, and outputting pedestrian characteristics;
acquisition of RGB frame I t : frame taking processing is carried out on the video to obtain an original RGB frame I t Wherein t is [1, …, N];
Extracting the features of the RGB frame: will original RGB frame I t Input to backbone network, output feature F t
The backbone network adopts DLA-34, and other convolutional neural networks can also be adopted.
S102, calculating to obtain a current frame detection frame and a corresponding current frame appearance embedded vector according to the pedestrian characteristics;
s103, judging whether the current frame image is a first frame, if the current frame image is not the first frame, namely t is not equal to 1, detecting a frame D of the current frame i∈[1,M] And corresponding said current frame appearance embedding vector ED i∈[1,M] With the track T of the previous frame t-1,j∈[1,k] And an appearance embedded vector ET updated following the previous frame track t-1,j∈[1,k] Performing matching association;
wherein M is the number of detection frames of the current frame, K is the number of tracks of the previous frame, and the matched detection frames are updated to the corresponding tracks
If the current frame image is the first frame, track initialization is carried out, and the current frame track quantization score is equal to the confidence of the current frame detection frame
Figure GDA0004175544050000073
The current frame appearance embedding is the current frame appearance embedding corresponding to the current frame detection frame +.>
Figure GDA0004175544050000074
S104, if the matching is successful, calculating the track quantization score of the current frame according to the confidence coefficient of the detection frame of the current frame and the track quantization score of the previous frame, embedding the appearance of the current frame for updating, judging the track state, marking the track of the activation state which is successfully matched as a tracking state, and marking the track state which is temporarily lost and successfully matched as an activation state;
when the matching is successful
Figure GDA0004175544050000071
The appearance is embeddedET t =(α-δ)×ET t-1 +(1-α+δ)×ED t
Wherein ST is track quantization score, SD is detection frame confidence, ET and ED are track and detection frame appearance embedding respectively,
Figure GDA0004175544050000072
the influence factor obtained through the confidence coefficient calculation of the detection frame can endow high-quality detection appearance with higher embedded weight, and alpha is a constant.
S105, if the matching fails, subtracting a preset constant from the quantization score of the previous frame track to obtain the quantization score ST of the current frame track t =ST t-1 -C lost The appearance embedding is not updated, a current frame detection frame which fails to be matched is marked as a new inactive state track, the current frame track state is reset according to the current frame track quantization score and a preset threshold value to continue matching, and when the current frame quantization score is smaller than the preset threshold value, the current frame track is discarded;
C lost is constant, C in this example lost =0.03;
Marking the current frame detection frame with failed matching as a new inactive state track;
when the quantization fraction of the inactive state trace is greater than or equal to a first preset threshold value theta 1 When activated, the track is activated to an active state track, θ in this embodiment 1 =0.5;
When the active state trace fails to match in a plurality of frames, but the quantization score thereof is not lower than a second preset threshold value theta 2 When the state is changed to the temporary lost state, the state is changed from the temporary lost state to the active state if the temporary lost state track is successfully matched again, in this embodiment θ 2 =0.15;
And when the quantization score of the temporary loss state track is smaller than a second preset threshold value, the target is considered to be disappeared in the video sequence, and the track is changed into a discarding state, so that the subsequent matching link is not participated.
And S106, carrying out the operation on the next frame of image until the video is finished.
If t=n, the process is terminated, otherwise t=t+1, the next frame is processed, and the process goes to step S101.
The data association module has less attention in multi-target tracking, the current mainstream algorithm pursues more accurate detection results and has more discriminant appearance embedding, but ignores the diversity of the tracks, and the tracks with different qualities are processed in the same way, so that the tracks with low quality interfere with the matching link, and the tracks with high quality cannot participate in more times of matching.
Therefore, the invention designs a dynamic track quantization strategy, and adopts a quantization score to explicitly represent the quality of the track. Different mechanisms are dynamically employed to update the quantization score, state, and appearance embedding of the track based on the results of each frame match.
The invention provides a dynamic track quantification strategy, which considers the diversity of tracks, introduces the quality of the tracks and detection results into the updating of the tracks, prolongs the duration of the high-quality tracks, increases the possibility of matching the high-quality tracks with a subsequent detection frame, and compared with the high-quality tracks, the low-quality tracks can be ended in a shorter time, thereby reducing the interference to a matching link. Through the strategy, identity switching in matching is effectively reduced, so that tracking accuracy and target identity robustness are improved.
As shown in fig. 2, based on the above embodiment, the present embodiment further describes step S102 in detail, specifically as follows:
s201, inputting the pedestrian characteristics into a channel enhancement characteristic re-planning module which is trained by an Adam algorithm together with an overall model, and adaptively dividing the pedestrian characteristics into detection subtask characteristics and tracking subtask characteristics;
inputting the pedestrian characteristics into the channel enhancement characteristic re-planning module;
the pedestrian feature F t ∈R H×W×C The first characteristic F is obtained after two point-by-point convolutions q ∈R H×W×1 And second feature F v ∈R H×W×rC
By applying the first feature F q After softmax function, re-align with second feature F v Performing matrix multiplication to obtain a pedestrian feature vector V containing global information and channel dimension information cha At this time F v R=2 in the dimension of (2);
-extracting said pedestrian feature vector V cha After adopting two groups of convolution layer-normalization-ReLU activation function-channel confusion operation, obtaining the detection subtask feature vector V det And the tracking subtask feature vector V id
Characterizing the pedestrian F t ∈R H×W×C The number of input channels is enlarged by r times (namely rC, C is the original number of channels) and the residual branches of the convolution layer-normalization-ReLU activation function-channel confusion operation are included, so that reconstructed pedestrian characteristics F' are obtained;
the detection subtask feature vector V det And the tracking subtask feature vector V id After broadcasting, the corresponding multiplication of element level is carried out on the reconstructed pedestrian characteristic F' to obtain the detection subtask characteristic F required by the final subtask det And the tracking subtask feature F id
S202, respectively calculating the current frame detection frame and the corresponding current frame appearance embedded vector according to the detection subtask features and the tracking subtask features.
The detection subtask features include thermodynamic diagrams, offset branches, and size branches;
convolving the detection subtask features and performing ReLU activation operation to obtain thermodynamic diagram tensor O heatmap Offset branch tensor O offset And a size branch tensor O size
According to O heatmap ,O offset And O size Calculating the current frame detection frame D i ,i∈[1,…,M]M is the number of detection frames of the current frame;
the tracking subtask features include appearance embedding branches;
the tracking subtask is specially usedPerforming convolution and ReLU activation operation on the sign to obtain an appearance embedded tensor O id
Detecting a frame D according to the current frame i Is at the center point of O id Corresponding to the corresponding position extraction of the current frame appearance embedded vector ED i ,i∈[1,…,M]。
The channel enhancement feature re-planning module CEFR provided by the invention re-plans the features output by the backbone network into two task specific features, in the training of the model, the features of the detection subtasks are more sensitive to the position information, the features of the tracking subtasks are more sensitive to the identity information, the interference between the two subtasks is reduced, the optimization conflict between the subtasks is relieved, the suitability of the features is improved, and more accurate detection results and more identifiable appearance embedding are obtained. In addition, global dense connection is established from channel dimensions, multi-scale channel information is enriched, the defect of long-range dependence is overcome, channel confusion operation is carried out, and interaction between channels is enhanced. The module relieves the optimization conflict between the combined detection and tracking paradigm neutron tasks to a certain extent. Features of subtasks are supplemented with inter-channel information and long-range dependencies. The input and output sizes are unchanged and can be used in other similar algorithms with the same paradigm.
Based on the above examples, in order to verify the accuracy and robustness of the present invention, experiments were performed on the disclosed MOT17 and MOT20 data sets, specifically as follows:
the MOT17 data set comprises 14 video sequences and 1342 tracks, wherein interference factors such as different camera angles, different weather conditions, different camera movements and the like exist, and the crowd density distribution is balanced. The detection result of MOT17 is obtained by detecting three different detectors of DPM, SDP and FasterR-CNN.
MOT20 is a newer dataset containing a total of 8 video sequences, about 13400 frames. The track number is similar to MOT17, but the crowd density of MOT20 is almost three times that of MOT17, and the method belongs to a dense scene and has more challenges to the algorithm. Meanwhile, interference factors such as different camera angles exist.
The experiment is divided into two parts, namely an online test set test and an offline division verification set verification. On-line testing:
setting experimental parameters: the training round on MOT17 was 20 rounds, the input picture size was adjusted to 1088 x 608, the learning rate was initially 0.0001, and the last 10 rounds dropped to 0.00001. 14 video sequences, half for training and half for testing. Training runs on MOT20 were 15 runs, the first 10 runs with a learning rate of 0.0001, and the second 5 runs with 0.00001,4 video sequences for training and 4 for testing. The backbone network employs DLA-34.
TABLE 1 MOTA indicator test results on MOT17 and MOT20
Data set MOT17 MOT20
FairMOT 73.7% 61.8%
The invention is that 75.2% 63.9%
Table 1 the method of the present invention achieves higher MOTA on both MOT17 and MOT20 data sets. Although the two data sets have the difficulties of shielding, disordered background, visual angle transformation and the like, and the crowd density of the MOT20 is high, the method provided by the invention has good robustness to the difficulties, so that the method is remarkably improved.
Offline verification:
setting experimental parameters: for 7 video sequences for training of MOT17, the first half of the frame of each sequence was taken as the training set for the validation experiment and the second half as the validation set for the validation experiment. The training round is 20 on the newly divided training set, the input picture size is adjusted to 1088×608, the learning rate is 0.0001 at the beginning, and the last 10 rounds are reduced to 0.00001. The backbone network employs DLA-34.
The method provided by the invention comprises two parts, namely a dynamic track quantization strategy and a channel enhancement feature re-planning CEFR module. As can be seen from table 2, the base line network fatimot had a MOT of 71.1% on the validation set of MOT17, and after only adding the dynamic trajectory quantization strategy, the MOTA reached 71.5%, and after only adding the channel enhancement feature reprofiling CEFR module, the MOTA reached 72.7%, and both were added to the fatmot, with a final MOTA of 73.4%. This shows that both mechanisms have a good impact on the performance of multi-target tracking. The dynamic track quantization strategy treats tracks with different qualities more fairly; the channel enhancement feature re-planning CEFR module relieves optimization conflicts among subtasks, and both can effectively improve tracking accuracy.
Table 2 results of experiments on MOT17 validation set (MOTA)
Data set MOT17 validation set
FairMOT 71.1%
FairMOT+CEFR module 72.7%
FairMOT+ dynamic trajectory quantization strategy 71.5%
FairMOT+dynamic trajectory quantization strategy+CEFR module 73.4%
As shown in fig. 3 and 4, the algorithm is improved on the basis of FairMOT, with RGB frames as input, and the model includes 4 key parts: the method comprises the steps of (1) feature extraction, (2) channel enhancement feature re-planning CEFR module, (3) detection and tracking of two subtask branches, and (4) data association module. The detection subtask comprises three branches of thermodynamic diagram, offset and detection frame size, and only one branch is embedded in the tracking subtask. The invention designs a dynamic track quantification strategy, explicitly shows the quality of tracks, highlights the difference between tracks with different quality, treats the tracks with different quality more fairly, and reduces the interference of tracks with low quality on a matching link. The invention also constructs a channel enhancement feature re-planning module, focuses more on the interaction between channel dimension information and channels, provides long-range dependence for the network, and relieves optimization conflicts among subtasks to a certain extent. Compared with the existing multi-target tracking method, the method has higher tracking accuracy and more robust maintenance of the target identity.
The invention discloses a multi-target tracking method based on a dynamic track quantization strategy and feature re-planning. The method adopts a multi-target tracking framework based on a combined detection and tracking range. Since most algorithms initialize unmatched detection frames as tracks and terminate unmatched tracks exceeding a threshold number of frames, and ignore differences between tracks of different quality when processing the beginning and ending of the tracks, the invention provides a dynamic track quality quantization strategy, which explicitly characterizes the quality of each track by dynamically scoring the track and adopts a dynamic update mechanism according to different matching results. In addition, aiming at the problem of conflict between detection and tracking of two subtasks in a combined detection and tracking model, the invention designs a channel enhanced feature re-planning module, which drives the two subtasks to learn features respectively, improves the suitability of the features and provides more accurate detection results for a dynamic track quantification strategy.
Referring to fig. 5, fig. 5 is a block diagram of a multi-target tracking device according to an embodiment of the present invention; the specific apparatus may include:
the feature extraction module 100 is used for acquiring an original frame image of a video, inputting the original frame image into a backbone network, and outputting pedestrian features;
the target detection and appearance extraction module 200 is configured to calculate a current frame detection frame and a corresponding current frame appearance embedded vector according to the pedestrian characteristics;
the matching association module 300 is configured to determine whether the current frame image is a first frame, and if the current frame image is not the first frame, match and associate a current frame detection frame and a corresponding appearance embedded vector of the current frame with a previous frame track and an appearance embedded vector updated along with the previous frame track;
the successful matching track updating module 400 is configured to calculate a current frame track quantization score according to a current frame detection frame confidence and a previous frame track quantization score if the matching is successful, embed the appearance of the current frame for updating, judge a track state, mark a successful matching activation state track as a tracking state, and mark a successful matching temporary loss track state as an activation state;
the matching failure track updating module 500 is configured to subtract a preset constant from the previous frame track quantization score to obtain the current frame track quantization score if the matching fails, embed the current frame appearance without updating, mark the current frame detection frame with the matching failure as a new inactive state track, reset the current frame track state according to the current frame track quantization score and a preset threshold, and continuously match the current frame track when the current frame quantization score is smaller than the preset threshold;
and the ending judgment module 600 is used for carrying out the operation on the next frame of image until the video ends.
The multi-target tracking apparatus of this embodiment is used to implement the multi-target tracking apparatus method described above, and therefore, the specific implementation of the multi-target tracking apparatus may be found in the example portions of the multi-target tracking apparatus method described above, for example, the feature extraction module 100, the object detection and appearance extraction module 200, the matching association module 300, the matching success track update module 400, the matching failure track update module 500, and the end judgment module 600, which are respectively used to implement steps S101, S102, S103, S104, S105, and S106 in the multi-target tracking apparatus method described above, so that the specific implementation thereof may refer to the description of the examples of the respective portions thereof, and will not be repeated herein.
The specific embodiment of the invention also provides a computer readable storage medium, wherein the computer readable storage medium stores a computer program, and the computer program realizes the steps of the multi-target tracking method when being executed by a processor.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It is apparent that the above examples are given by way of illustration only and are not limiting of the embodiments. Other variations and modifications of the present invention will be apparent to those of ordinary skill in the art in light of the foregoing description. It is not necessary here nor is it exhaustive of all embodiments. While still being apparent from variations or modifications that may be made by those skilled in the art are within the scope of the invention.

Claims (8)

1. A multi-target tracking method, comprising:
the method comprises the steps of obtaining an original frame image of a video, inputting the original frame image into a backbone network, and outputting pedestrian characteristics;
calculating according to the pedestrian characteristics to obtain a current frame detection frame and a corresponding current frame appearance embedded vector;
judging whether the current frame image is a first frame or not, if the current frame image is not the first frame, matching and associating a current frame detection frame and the corresponding current frame appearance embedded vector with a previous frame track and an appearance embedded vector updated along with the previous frame track;
if the matching is successful, calculating the track quantization score of the current frame according to the confidence coefficient of the detection frame of the current frame and the track quantization score of the previous frame, embedding the appearance of the current frame for updating, judging the track state, marking the track of the activation state which is successfully matched as a tracking state, and marking the track state which is temporarily lost and successfully matched as an activation state;
if the matching fails, subtracting a preset constant from the previous frame track quantization score to obtain the current frame track quantization score, wherein the appearance embedding is not updated, marking the current frame detection frame with failed matching as a new inactive state track, resetting the current frame track state according to the current frame track quantization score and a preset threshold value, continuing to match, and discarding the current frame track when the current frame quantization score is smaller than the preset threshold value;
marking the current frame detection frame with failed matching as a new inactive state track; when the quantization score of the inactive state track is greater than or equal to a first preset threshold value, the inactive state track is activated to be an active state track; when the matching of the active state track in a plurality of frames fails, but the quantization score is not lower than a second preset threshold value, the state is changed into a temporary lost state to continue the matching, and if the temporary lost state track is successfully matched again, the state is changed into the active state from the temporary lost state; when the quantization score of the temporary loss state track is smaller than a second preset threshold value, the target is considered to be disappeared in the video sequence, and the track is changed into a discarding state, so that the subsequent matching link is not participated;
performing the above operation on the next frame of image until the video is finished;
the step of calculating the track quantization score of the current frame according to the confidence of the detection frame of the current frame and the track quantization score of the previous frame, and the step of embedding and updating the appearance of the current frame comprises the following steps:
when the matching is successful
Figure FDA0004175544040000021
The appearance is embedded into ET t =(α-δ)×ET t-1 +(1-α+δ)×ED t
Wherein ST is track quantization score, SD is detectionFrame confidence, ET and ED are embedded for the appearance of the track and detection frame respectively,
Figure FDA0004175544040000022
is an influence factor obtained by calculating the confidence coefficient of the detection frame, and alpha is a constant.
2. The multi-target tracking method of claim 1, wherein the calculating a current frame detection frame and a corresponding current frame appearance embedding vector from the pedestrian feature comprises:
inputting the pedestrian characteristics into a channel enhancement characteristic re-planning module which is trained by an Adam algorithm together with the overall model, and adaptively dividing the pedestrian characteristics into detection subtask characteristics and tracking subtask characteristics;
and respectively calculating the current frame detection frame and the corresponding current frame appearance embedded vector according to the detection subtask characteristics and the tracking subtask characteristics.
3. The multi-objective tracking method according to claim 2, wherein the step of inputting the pedestrian feature into a channel enhancement feature re-planning module trained with an ensemble model using Adam algorithm, and the step of adaptively dividing the pedestrian feature into a detection subtask feature and a tracking subtask feature comprises:
inputting the pedestrian characteristics into the channel enhancement characteristic re-planning module;
the pedestrian feature F t ∈R H×W×C The first characteristic F is obtained after two point-by-point convolutions q ∈R H×W×1 And second feature F v ∈R H×W×rC
By applying the first feature F q After softmax function, re-align with second feature F v Performing matrix multiplication to obtain a pedestrian feature vector V containing global information and channel dimension information cha
-extracting said pedestrian feature vector V cha Using two sets of convolution layers-normalizationAfter the operation of the unification-ReLU activation function-channel confusion, the detection subtask feature vector V is obtained respectively det And the tracking subtask feature vector V id
Characterizing the pedestrian F t ∈R H×W×C The number of input channels is enlarged by r times, and reconstructed pedestrian characteristics F' are obtained after residual branches of the convolution layer-normalization-ReLU activation function-channel confusion operation are included;
the detection subtask feature vector V det And the tracking subtask feature vector V id After broadcasting, the corresponding multiplication of element level is carried out on the reconstructed pedestrian characteristic F' to obtain the detection subtask characteristic F required by the final subtask det And the tracking subtask feature F id
4. The multi-target tracking method of claim 2 wherein said calculating said current frame detection box from said detection subtask features comprises:
the detection subtask features include thermodynamic diagrams, offset branches, and size branches;
convolving the detection subtask features and performing ReLU activation operation to obtain thermodynamic diagram tensor O heatmap Offset branch tensor O offset And a size branch tensor O size
According to O heatmap ,O offset And O size Calculating the current frame detection frame D i ,i∈[1,…,M]M is the number of detection frames of the current frame.
5. The multi-target tracking method of claim 4 wherein said computing the current frame appearance embedding vector from the tracking subtask features comprises:
the tracking subtask features include appearance embedding branches;
convolving the tracking subtask features and performing ReLU activation operation to obtain an appearance embedded tensor O id
Detecting a frame D according to the current frame i Is at the center point of O id Corresponding to the corresponding position extraction of the current frame appearance embedded vector ED i ,i∈[1,…,M]。
6. The multi-target tracking method according to claim 1, wherein the determining whether the current frame image is the first frame comprises:
if the current frame image is the first frame, track initialization is carried out, and the current frame track quantization score is equal to the confidence of the current frame detection frame
Figure FDA0004175544040000041
The current frame appearance embedding is the current frame appearance embedding corresponding to the current frame detection frame +.>
Figure FDA0004175544040000042
7. A multi-target tracking apparatus, comprising:
the feature extraction module is used for acquiring an original frame image of the video, inputting the original frame image into a backbone network and outputting pedestrian features;
the target detection and appearance extraction module is used for calculating and obtaining a current frame detection frame and a corresponding current frame appearance embedded vector according to the pedestrian characteristics;
the matching association module is used for judging whether the current frame image is a first frame or not, and if the current frame image is not the first frame, matching and associating a current frame detection frame and the corresponding current frame appearance embedded vector with a previous frame track and an appearance embedded vector updated along with the previous frame track;
the track updating module is used for calculating the track quantization score of the current frame according to the confidence coefficient of the current frame detection frame and the track quantization score of the previous frame if the matching is successful, embedding the appearance of the current frame for updating, judging the track state, marking the track of the activation state which is successfully matched as a tracking state, and marking the track state which is temporarily lost and successfully matched as an activation state; the step of calculating the track quantization score of the current frame according to the confidence of the detection frame of the current frame and the track quantization score of the previous frame, and the step of embedding and updating the appearance of the current frame comprises the following steps:
when the matching is successful
Figure FDA0004175544040000051
The appearance is embedded into ET t =(α-δ)×ET t-1 +(1-α+δ)×ED t
Wherein ST is track quantization score, SD is detection frame confidence, ET and ED are track and detection frame appearance embedding respectively,
Figure FDA0004175544040000052
an influence factor obtained by calculating the confidence coefficient of the detection frame, wherein alpha is a constant;
the matching failure track updating module is used for subtracting a preset constant from the previous frame track quantization score to obtain the current frame track quantization score if the matching fails, embedding the current frame appearance into the current frame without updating, marking the current frame detection frame which fails to match as a new inactive state track, resetting the current frame track state according to the current frame track quantization score and a preset threshold value, continuing to match, and discarding the current frame track when the current frame quantization score is smaller than the preset threshold value; when the quantization score of the inactive state track is greater than or equal to a first preset threshold value, the inactive state track is activated to be an active state track; when the matching of the active state track in a plurality of frames fails, but the quantization score is not lower than a second preset threshold value, the state is changed into a temporary lost state to continue the matching, and if the temporary lost state track is successfully matched again, the state is changed into the active state from the temporary lost state; when the quantization score of the temporary loss state track is smaller than a second preset threshold value, the target is considered to be disappeared in the video sequence, and the track is changed into a discarding state, so that the subsequent matching link is not participated;
and the ending judgment module is used for carrying out the operation on the next frame of image until the video ends.
8. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of a multi-object tracking method according to any of claims 1 to 6.
CN202210343596.XA 2022-04-02 2022-04-02 Multi-target tracking method for dynamic track quality quantification and feature re-planning Active CN114972417B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210343596.XA CN114972417B (en) 2022-04-02 2022-04-02 Multi-target tracking method for dynamic track quality quantification and feature re-planning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210343596.XA CN114972417B (en) 2022-04-02 2022-04-02 Multi-target tracking method for dynamic track quality quantification and feature re-planning

Publications (2)

Publication Number Publication Date
CN114972417A CN114972417A (en) 2022-08-30
CN114972417B true CN114972417B (en) 2023-06-30

Family

ID=82976823

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210343596.XA Active CN114972417B (en) 2022-04-02 2022-04-02 Multi-target tracking method for dynamic track quality quantification and feature re-planning

Country Status (1)

Country Link
CN (1) CN114972417B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20190023389A (en) * 2017-08-29 2019-03-08 인하대학교 산학협력단 Multi-Class Multi-Object Tracking Method using Changing Point Detection
CN114255434A (en) * 2022-03-01 2022-03-29 深圳金三立视频科技股份有限公司 Multi-target tracking method and device

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019006632A1 (en) * 2017-07-04 2019-01-10 深圳大学 Video multi-target tracking method and device
CN111709974B (en) * 2020-06-22 2022-08-02 苏宁云计算有限公司 Human body tracking method and device based on RGB-D image
CN113592902A (en) * 2021-06-21 2021-11-02 北京迈格威科技有限公司 Target tracking method and device, computer equipment and storage medium
CN114241007B (en) * 2021-12-20 2022-08-05 江南大学 Multi-target tracking method based on cross-task mutual learning, terminal equipment and medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20190023389A (en) * 2017-08-29 2019-03-08 인하대학교 산학협력단 Multi-Class Multi-Object Tracking Method using Changing Point Detection
CN114255434A (en) * 2022-03-01 2022-03-29 深圳金三立视频科技股份有限公司 Multi-target tracking method and device

Also Published As

Publication number Publication date
CN114972417A (en) 2022-08-30

Similar Documents

Publication Publication Date Title
CN110472594B (en) Target tracking method, information insertion method and equipment
US11514625B2 (en) Motion trajectory drawing method and apparatus, and device and storage medium
CN108986134B (en) Video target semi-automatic labeling method based on related filtering tracking
CN110889863B (en) Target tracking method based on target perception correlation filtering
US11062455B2 (en) Data filtering of image stacks and video streams
CN116935447B (en) Self-adaptive teacher-student structure-based unsupervised domain pedestrian re-recognition method and system
CN115035418A (en) Remote sensing image semantic segmentation method and system based on improved deep LabV3+ network
CN112396042A (en) Real-time updated target detection method and system, and computer-readable storage medium
CN111836118A (en) Video processing method, device, server and storage medium
CN114449313B (en) Method and device for adjusting audio and video playing rate of video
CN111027555A (en) License plate recognition method and device and electronic equipment
CN112101320A (en) Model training method, image generation method, device, equipment and storage medium
CN114972417B (en) Multi-target tracking method for dynamic track quality quantification and feature re-planning
CN110782005A (en) Image annotation method and system for tracking based on weak annotation data
Zhao et al. DQN-based gradual fisheye image rectification
CN113392702A (en) Target identification method based on self-adaptive image enhancement under low-light environment
CN113408332A (en) Video mirror splitting method, device, equipment and computer readable storage medium
CN112966754A (en) Sample screening method, sample screening device and terminal equipment
TWI803243B (en) Method for expanding images, computer device and storage medium
CN111814190B (en) Privacy protection method based on differential privacy distributed deep learning optimization
CN113095884B (en) Television member user recommendation method and system based on user feedback
CN114299435A (en) Scene clustering method and device in video and related equipment
KR102215289B1 (en) Video inpainting operating method and apparatus performing the same
CN117541764B (en) Image stitching method, electronic equipment and storage medium
CN113569660B (en) Learning rate optimization algorithm discount coefficient method for hyperspectral image classification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant