CN114972417B

CN114972417B - Multi-target tracking method for dynamic track quality quantification and feature re-planning

Info

Publication number: CN114972417B
Application number: CN202210343596.XA
Authority: CN
Inventors: 孔军; 张元澍; 蒋敏
Original assignee: Jiangnan University
Current assignee: Jiangnan University
Priority date: 2022-04-02
Filing date: 2022-04-02
Publication date: 2023-06-30
Anticipated expiration: 2042-04-02
Also published as: CN114972417A

Abstract

The invention discloses a multi-target tracking method based on a dynamic track quantization strategy and feature re-planning. The method adopts a multi-target tracking framework based on a combined detection and tracking range. Since most algorithms initialize unmatched detection frames as tracks and terminate unmatched tracks exceeding a threshold number of frames, and ignore differences between tracks of different quality when processing the beginning and ending of the tracks, the invention provides a dynamic track quality quantization strategy, which explicitly characterizes the quality of each track by dynamically scoring the track and adopts different dynamic updating mechanisms according to different matching results. In addition, aiming at the problem of conflict between detection and tracking of two subtasks in a combined detection and tracking model, the invention designs a channel enhanced feature re-planning module, which drives the two subtasks to learn different features respectively, improves the suitability of the features and provides more accurate detection results for a dynamic track quantization strategy.

Description

Multi-target tracking method for dynamic track quality quantification and feature re-planning

Technical Field

The present invention relates to the field of machine vision, and in particular, to a multi-target tracking method, apparatus, and computer storage medium.

Background

With extensive research in machine vision both theoretically and practically, multi-objective tracking is also becoming one of the important branches. Due to the diversity of objective environments and the subjective complexity of target behavior, multi-target tracking has many problems to be solved. At present, multi-target tracking is mainly divided into two modes of detection-before-tracking and combined detection and tracking.

Multi-target tracking studies are mostly based on a detection-first-tracking paradigm when the combined detection and tracking paradigm has not yet emerged. These algorithms divide the multi-target tracking task into two separate subtasks, detection and tracking, and models are also independent of each other. Although the effect is not popular, the calculation cost is high, subtasks cannot be optimized together, and the balance between accuracy and speed is difficult to obtain. In comparison, in the combined detection and tracking paradigm, two subtasks are fused into a unified network, algorithm complexity brought by staged processing is reduced, meanwhile, coupling degree between functional modules is increased, higher precision and better balance are achieved, in the combined detection and tracking paradigm, attention of a data association module in multi-target tracking is less, a currently mainstream algorithm pursues a more accurate detection result, discriminative appearance embedding is achieved, diversity of tracks is ignored, tracks with different qualities are processed in the same mode, so that tracks with too low quality interfere with matching links, and tracks with high quality cannot participate in matching for more times. However, in the combined detection and tracking mode, the diversity of the tracks is ignored, tracks with different qualities are treated equally, and when the lost frame number reaches a threshold value, the tracks are discarded regardless of the quality of the tracks, so that the accuracy of multi-target tracking is affected.

Disclosure of Invention

Therefore, the technical problem to be solved by the invention is to solve the problems that in the prior art, track diversity is ignored, and tracks with different quality are discarded when the loss time reaches a threshold value, so that the detection accuracy is affected.

In order to solve the technical problems, the invention provides a multi-target tracking method, which comprises the following steps:

the method comprises the steps of obtaining an original frame image of a video, inputting the original frame image into a backbone network, and outputting pedestrian characteristics;

calculating according to the pedestrian characteristics to obtain a current frame detection frame and a corresponding current frame appearance embedded vector;

judging whether the current frame image is a first frame or not, if the current frame image is not the first frame, matching and associating a current frame detection frame and the corresponding current frame appearance embedded vector with a previous frame track and an appearance embedded vector updated along with the previous frame track;

if the matching is successful, calculating the track quantization score of the current frame according to the confidence coefficient of the detection frame of the current frame and the track quantization score of the previous frame, embedding the appearance of the current frame for updating, judging the track state, marking the track of the activation state which is successfully matched as a tracking state, and marking the track state which is temporarily lost and successfully matched as an activation state;

if the matching fails, subtracting a preset constant from the previous frame track quantization score to obtain the current frame track quantization score, wherein the appearance embedding is not updated, marking the current frame detection frame with failed matching as a new inactive state track, resetting the current frame track state according to the current frame track quantization score and a preset threshold value, continuing to match, and discarding the current frame track when the current frame quantization score is smaller than the preset threshold value;

and carrying out the operation on the next frame of image until the video is finished.

Preferably, the calculating the current frame detection frame and the corresponding current frame appearance embedded vector according to the pedestrian feature includes:

inputting the pedestrian characteristics into a channel enhancement characteristic re-planning module which is trained by an Adam algorithm together with the overall model, and adaptively dividing the pedestrian characteristics into detection subtask characteristics and tracking subtask characteristics;

and respectively calculating the current frame detection frame and the corresponding current frame appearance embedded vector according to the detection subtask characteristics and the tracking subtask characteristics.

Preferably, the channel enhancement feature re-planning module after the pedestrian feature is input and trained together with the overall model by using Adam algorithm, and adaptively dividing the pedestrian feature into a detection subtask feature and a tracking subtask feature includes:

inputting the pedestrian characteristics into the channel enhancement characteristic re-planning module;

the pedestrian feature F _t ∈R ^H×W×C The first characteristic F is obtained after two point-by-point convolutions _q ∈R ^H×W×1 And second feature F _v ∈R ^H×W×rC ；

By applying the first feature F _q After softmax function, re-align with second feature F _v Performing matrix multiplication to obtain a pedestrian feature vector V containing global information and channel dimension information _cha ；

-extracting said pedestrian feature vector V _cha After adopting two groups of convolution layer-normalization-ReLU activation function-channel confusion operations, the detection subtask feature vector V is obtained respectively _det And the tracking subtask feature vector V _id ；

Characterizing the pedestrian F _t ∈R ^H×W×C The number of input channels is expanded by r times and comprises the residue of the convolutional layer-normalization-ReLU activation function-channel aliasing operationObtaining a reconstructed pedestrian characteristic F' after the difference branch;

the detection subtask feature vector V _det And the tracking subtask feature vector V _id After broadcasting, the corresponding multiplication of element level is carried out on the reconstructed pedestrian characteristic F' to obtain the detection subtask characteristic F required by the final subtask _det And the tracking subtask feature F _id 。

Preferably, the calculating the current frame detection frame according to the detection subtask features includes:

the detection subtask features include thermodynamic diagrams, offset branches, and size branches;

convolving the detection subtask features and performing ReLU activation operation to obtain thermodynamic diagram tensor O _heatmap Offset branch tensor O _offset And a size branch tensor O _size ；

According to O _heatmap ，O _offset And O _size Calculating the current frame detection frame D _i ，i∈[1,…,M]M is the number of detection frames of the current frame.

Preferably, the calculating the current frame appearance embedded vector according to the tracking subtask features includes:

the tracking subtask features include appearance embedding branches;

convolving the tracking subtask features and performing ReLU activation operation to obtain an appearance embedded tensor O _id ；

Detecting a frame D according to the current frame _i Is at the center point of O _id Corresponding to the corresponding position extraction of the current frame appearance embedded vector ED _i ，i∈[1,…,M]。

Preferably, if the matching is successful, calculating the track quantization score of the current frame according to the confidence of the detection frame of the current frame and the track quantization score of the previous frame, and embedding and updating the appearance of the current frame includes:

when the matching is successful

The appearance is thatEmbedded ET _t ＝(α-δ)×ET _t-1 +(1-α+δ)×ED _t ；

Wherein ST is track quantization score, SD is detection frame confidence, ET and ED are track and detection frame appearance embedding respectively,

is an influence factor obtained by calculating the confidence coefficient of the detection frame, and alpha is a constant.

Preferably, the marking the current frame detection frame with failed matching as a new inactive state track, resetting the current frame track state according to the current frame track quantization score and a preset threshold value, continuing matching, and discarding the current frame track when the current frame quantization score is smaller than the preset threshold value comprises:

marking the current frame detection frame with failed matching as a new inactive state track;

when the quantization score of the inactive state track is greater than or equal to a first preset threshold value, the inactive state track is activated to be an active state track;

when the matching of the active state track in a plurality of frames fails, but the quantization score is not lower than a second preset threshold value, the state is changed into a temporary lost state to continue the matching, and if the temporary lost state track is successfully matched again, the state is changed into the active state from the temporary lost state;

and when the quantization score of the temporary loss state track is smaller than a second preset threshold value, the target is considered to be disappeared in the video sequence, and the track is changed into a discarding state, so that the subsequent matching link is not participated.

The invention also provides a device for multi-target tracking, which comprises:

the feature extraction module is used for acquiring an original frame image of the video, inputting the original frame image into a backbone network and outputting pedestrian features;

the target detection and appearance extraction module is used for calculating and obtaining a current frame detection frame and a corresponding current frame appearance embedded vector according to the pedestrian characteristics;

the matching association module is used for judging whether the current frame image is a first frame or not, and if the current frame image is not the first frame, matching and associating a current frame detection frame and the corresponding current frame appearance embedded vector with a previous frame track and an appearance embedded vector updated along with the previous frame track;

the track updating module is used for calculating the track quantization score of the current frame according to the confidence coefficient of the current frame detection frame and the track quantization score of the previous frame if the matching is successful, embedding the appearance of the current frame for updating, judging the track state, marking the track of the activation state which is successfully matched as a tracking state, and marking the track state which is temporarily lost and successfully matched as an activation state;

the matching failure track updating module is used for subtracting a preset constant from the previous frame track quantization score to obtain the current frame track quantization score if the matching fails, embedding the current frame appearance into the current frame without updating, marking the current frame detection frame which fails to match as a new inactive state track, resetting the current frame track state according to the current frame track quantization score and a preset threshold value, continuing to match, and discarding the current frame track when the current frame quantization score is smaller than the preset threshold value;

and the ending judgment module is used for carrying out the operation on the next frame of image until the video ends.

The present invention also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor implements the steps of a multi-target tracking method as described above.

Compared with the prior art, the technical scheme of the invention has the following advantages:

the invention evaluates the quality of each track, adopts the quantization score to explicitly represent the quality of each track, dynamically updates the quantization score in each frame according to the difference of the matching results, and then determines the state of the track according to the updated track quantization score. The invention considers the diversity of the track, introduces the quality of the track and the detection result into the updating of the track, prolongs the duration of the high-quality track, increases the possibility of matching with the subsequent detection frame, and compared with the track with low quality, the track with low quality can be terminated in a shorter time, thereby reducing the interference to the matching link. Through the strategy, identity switching in matching is effectively reduced, so that tracking accuracy and target identity robustness are improved.

Drawings

In order that the invention may be more readily understood, a more particular description of the invention will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings, in which:

FIG. 1 is a flow chart of an implementation of the multi-target tracking method of the present invention;

FIG. 2 is a schematic diagram of a channel enhancement feature reprofiling CEFR module;

FIG. 3 is a diagram of an algorithm model of the present invention;

FIG. 4 is a flow chart of one embodiment of the present invention;

fig. 5 is a block diagram of a multi-target tracking apparatus according to an embodiment of the present invention.

Detailed Description

The core of the invention is to provide a method, a device and a computer storage medium for scoring tracks, prolonging the duration of high-quality tracks and improving the detection precision.

In order to better understand the aspects of the present invention, the present invention will be described in further detail with reference to the accompanying drawings and detailed description. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1, fig. 1 is a flowchart of an implementation of a multi-target tracking method provided by the present invention; the specific operation steps are as follows:

s101, acquiring an original frame image of a video, inputting the original frame image into a backbone network, and outputting pedestrian characteristics;

acquisition of RGB frame I _t : frame taking processing is carried out on the video to obtain an original RGB frame I _t Wherein t is [1, …, N]；

Extracting the features of the RGB frame: will original RGB frame I _t Input to backbone network, output feature F _t ；

The backbone network adopts DLA-34, and other convolutional neural networks can also be adopted.

S102, calculating to obtain a current frame detection frame and a corresponding current frame appearance embedded vector according to the pedestrian characteristics;

s103, judging whether the current frame image is a first frame, if the current frame image is not the first frame, namely t is not equal to 1, detecting a frame D of the current frame _i∈[1,M] And corresponding said current frame appearance embedding vector ED _i∈[1,M] With the track T of the previous frame _{t-1,j∈[1,k]} And an appearance embedded vector ET updated following the previous frame track _{t-1,j∈[1,k]} Performing matching association;

wherein M is the number of detection frames of the current frame, K is the number of tracks of the previous frame, and the matched detection frames are updated to the corresponding tracks

If the current frame image is the first frame, track initialization is carried out, and the current frame track quantization score is equal to the confidence of the current frame detection frame

The current frame appearance embedding is the current frame appearance embedding corresponding to the current frame detection frame +.>

S104, if the matching is successful, calculating the track quantization score of the current frame according to the confidence coefficient of the detection frame of the current frame and the track quantization score of the previous frame, embedding the appearance of the current frame for updating, judging the track state, marking the track of the activation state which is successfully matched as a tracking state, and marking the track state which is temporarily lost and successfully matched as an activation state;

when the matching is successful

The appearance is embeddedET _t ＝(α-δ)×ET _t-1 +(1-α+δ)×ED _t ；

the influence factor obtained through the confidence coefficient calculation of the detection frame can endow high-quality detection appearance with higher embedded weight, and alpha is a constant.

S105, if the matching fails, subtracting a preset constant from the quantization score of the previous frame track to obtain the quantization score ST of the current frame track _t ＝ST _t-1 -C _lost The appearance embedding is not updated, a current frame detection frame which fails to be matched is marked as a new inactive state track, the current frame track state is reset according to the current frame track quantization score and a preset threshold value to continue matching, and when the current frame quantization score is smaller than the preset threshold value, the current frame track is discarded;

C _lost is constant, C in this example _lost ＝0.03；

when the quantization fraction of the inactive state trace is greater than or equal to a first preset threshold value theta ₁ When activated, the track is activated to an active state track, θ in this embodiment ₁ ＝0.5；

When the active state trace fails to match in a plurality of frames, but the quantization score thereof is not lower than a second preset threshold value theta ₂ When the state is changed to the temporary lost state, the state is changed from the temporary lost state to the active state if the temporary lost state track is successfully matched again, in this embodiment θ ₂ ＝0.15；

And S106, carrying out the operation on the next frame of image until the video is finished.

If t=n, the process is terminated, otherwise t=t+1, the next frame is processed, and the process goes to step S101.

The data association module has less attention in multi-target tracking, the current mainstream algorithm pursues more accurate detection results and has more discriminant appearance embedding, but ignores the diversity of the tracks, and the tracks with different qualities are processed in the same way, so that the tracks with low quality interfere with the matching link, and the tracks with high quality cannot participate in more times of matching.

Therefore, the invention designs a dynamic track quantization strategy, and adopts a quantization score to explicitly represent the quality of the track. Different mechanisms are dynamically employed to update the quantization score, state, and appearance embedding of the track based on the results of each frame match.

The invention provides a dynamic track quantification strategy, which considers the diversity of tracks, introduces the quality of the tracks and detection results into the updating of the tracks, prolongs the duration of the high-quality tracks, increases the possibility of matching the high-quality tracks with a subsequent detection frame, and compared with the high-quality tracks, the low-quality tracks can be ended in a shorter time, thereby reducing the interference to a matching link. Through the strategy, identity switching in matching is effectively reduced, so that tracking accuracy and target identity robustness are improved.

As shown in fig. 2, based on the above embodiment, the present embodiment further describes step S102 in detail, specifically as follows:

s201, inputting the pedestrian characteristics into a channel enhancement characteristic re-planning module which is trained by an Adam algorithm together with an overall model, and adaptively dividing the pedestrian characteristics into detection subtask characteristics and tracking subtask characteristics;

By applying the first feature F _q After softmax function, re-align with second feature F _v Performing matrix multiplication to obtain a pedestrian feature vector V containing global information and channel dimension information _cha At this time F _v R=2 in the dimension of (2);

-extracting said pedestrian feature vector V _cha After adopting two groups of convolution layer-normalization-ReLU activation function-channel confusion operation, obtaining the detection subtask feature vector V _det And the tracking subtask feature vector V _id ；

Characterizing the pedestrian F _t ∈R ^H×W×C The number of input channels is enlarged by r times (namely rC, C is the original number of channels) and the residual branches of the convolution layer-normalization-ReLU activation function-channel confusion operation are included, so that reconstructed pedestrian characteristics F' are obtained;

S202, respectively calculating the current frame detection frame and the corresponding current frame appearance embedded vector according to the detection subtask features and the tracking subtask features.

According to O _heatmap ，O _offset And O _size Calculating the current frame detection frame D _i ，i∈[1,…,M]M is the number of detection frames of the current frame;

the tracking subtask features include appearance embedding branches;

the tracking subtask is specially usedPerforming convolution and ReLU activation operation on the sign to obtain an appearance embedded tensor O _id ；

The channel enhancement feature re-planning module CEFR provided by the invention re-plans the features output by the backbone network into two task specific features, in the training of the model, the features of the detection subtasks are more sensitive to the position information, the features of the tracking subtasks are more sensitive to the identity information, the interference between the two subtasks is reduced, the optimization conflict between the subtasks is relieved, the suitability of the features is improved, and more accurate detection results and more identifiable appearance embedding are obtained. In addition, global dense connection is established from channel dimensions, multi-scale channel information is enriched, the defect of long-range dependence is overcome, channel confusion operation is carried out, and interaction between channels is enhanced. The module relieves the optimization conflict between the combined detection and tracking paradigm neutron tasks to a certain extent. Features of subtasks are supplemented with inter-channel information and long-range dependencies. The input and output sizes are unchanged and can be used in other similar algorithms with the same paradigm.

Based on the above examples, in order to verify the accuracy and robustness of the present invention, experiments were performed on the disclosed MOT17 and MOT20 data sets, specifically as follows:

the MOT17 data set comprises 14 video sequences and 1342 tracks, wherein interference factors such as different camera angles, different weather conditions, different camera movements and the like exist, and the crowd density distribution is balanced. The detection result of MOT17 is obtained by detecting three different detectors of DPM, SDP and FasterR-CNN.

MOT20 is a newer dataset containing a total of 8 video sequences, about 13400 frames. The track number is similar to MOT17, but the crowd density of MOT20 is almost three times that of MOT17, and the method belongs to a dense scene and has more challenges to the algorithm. Meanwhile, interference factors such as different camera angles exist.

The experiment is divided into two parts, namely an online test set test and an offline division verification set verification. On-line testing:

setting experimental parameters: the training round on MOT17 was 20 rounds, the input picture size was adjusted to 1088 x 608, the learning rate was initially 0.0001, and the last 10 rounds dropped to 0.00001. 14 video sequences, half for training and half for testing. Training runs on MOT20 were 15 runs, the first 10 runs with a learning rate of 0.0001, and the second 5 runs with 0.00001,4 video sequences for training and 4 for testing. The backbone network employs DLA-34.

TABLE 1 MOTA indicator test results on MOT17 and MOT20

Data set	MOT17	MOT20
			FairMOT	73.7％	61.8％
The invention is that	75.2％	63.9％

Table 1 the method of the present invention achieves higher MOTA on both MOT17 and MOT20 data sets. Although the two data sets have the difficulties of shielding, disordered background, visual angle transformation and the like, and the crowd density of the MOT20 is high, the method provided by the invention has good robustness to the difficulties, so that the method is remarkably improved.

Offline verification:

setting experimental parameters: for 7 video sequences for training of MOT17, the first half of the frame of each sequence was taken as the training set for the validation experiment and the second half as the validation set for the validation experiment. The training round is 20 on the newly divided training set, the input picture size is adjusted to 1088×608, the learning rate is 0.0001 at the beginning, and the last 10 rounds are reduced to 0.00001. The backbone network employs DLA-34.

The method provided by the invention comprises two parts, namely a dynamic track quantization strategy and a channel enhancement feature re-planning CEFR module. As can be seen from table 2, the base line network fatimot had a MOT of 71.1% on the validation set of MOT17, and after only adding the dynamic trajectory quantization strategy, the MOTA reached 71.5%, and after only adding the channel enhancement feature reprofiling CEFR module, the MOTA reached 72.7%, and both were added to the fatmot, with a final MOTA of 73.4%. This shows that both mechanisms have a good impact on the performance of multi-target tracking. The dynamic track quantization strategy treats tracks with different qualities more fairly; the channel enhancement feature re-planning CEFR module relieves optimization conflicts among subtasks, and both can effectively improve tracking accuracy.

Table 2 results of experiments on MOT17 validation set (MOTA)

Data set	MOT17 validation set
		FairMOT	71.1％
FairMOT+CEFR module	72.7％
		FairMOT+ dynamic trajectory quantization strategy	71.5％
FairMOT+dynamic trajectory quantization strategy+CEFR module	73.4％

As shown in fig. 3 and 4, the algorithm is improved on the basis of FairMOT, with RGB frames as input, and the model includes 4 key parts: the method comprises the steps of (1) feature extraction, (2) channel enhancement feature re-planning CEFR module, (3) detection and tracking of two subtask branches, and (4) data association module. The detection subtask comprises three branches of thermodynamic diagram, offset and detection frame size, and only one branch is embedded in the tracking subtask. The invention designs a dynamic track quantification strategy, explicitly shows the quality of tracks, highlights the difference between tracks with different quality, treats the tracks with different quality more fairly, and reduces the interference of tracks with low quality on a matching link. The invention also constructs a channel enhancement feature re-planning module, focuses more on the interaction between channel dimension information and channels, provides long-range dependence for the network, and relieves optimization conflicts among subtasks to a certain extent. Compared with the existing multi-target tracking method, the method has higher tracking accuracy and more robust maintenance of the target identity.

The invention discloses a multi-target tracking method based on a dynamic track quantization strategy and feature re-planning. The method adopts a multi-target tracking framework based on a combined detection and tracking range. Since most algorithms initialize unmatched detection frames as tracks and terminate unmatched tracks exceeding a threshold number of frames, and ignore differences between tracks of different quality when processing the beginning and ending of the tracks, the invention provides a dynamic track quality quantization strategy, which explicitly characterizes the quality of each track by dynamically scoring the track and adopts a dynamic update mechanism according to different matching results. In addition, aiming at the problem of conflict between detection and tracking of two subtasks in a combined detection and tracking model, the invention designs a channel enhanced feature re-planning module, which drives the two subtasks to learn features respectively, improves the suitability of the features and provides more accurate detection results for a dynamic track quantification strategy.

Referring to fig. 5, fig. 5 is a block diagram of a multi-target tracking device according to an embodiment of the present invention; the specific apparatus may include:

the feature extraction module 100 is used for acquiring an original frame image of a video, inputting the original frame image into a backbone network, and outputting pedestrian features;

the target detection and appearance extraction module 200 is configured to calculate a current frame detection frame and a corresponding current frame appearance embedded vector according to the pedestrian characteristics;

the matching association module 300 is configured to determine whether the current frame image is a first frame, and if the current frame image is not the first frame, match and associate a current frame detection frame and a corresponding appearance embedded vector of the current frame with a previous frame track and an appearance embedded vector updated along with the previous frame track;

the successful matching track updating module 400 is configured to calculate a current frame track quantization score according to a current frame detection frame confidence and a previous frame track quantization score if the matching is successful, embed the appearance of the current frame for updating, judge a track state, mark a successful matching activation state track as a tracking state, and mark a successful matching temporary loss track state as an activation state;

the matching failure track updating module 500 is configured to subtract a preset constant from the previous frame track quantization score to obtain the current frame track quantization score if the matching fails, embed the current frame appearance without updating, mark the current frame detection frame with the matching failure as a new inactive state track, reset the current frame track state according to the current frame track quantization score and a preset threshold, and continuously match the current frame track when the current frame quantization score is smaller than the preset threshold;

and the ending judgment module 600 is used for carrying out the operation on the next frame of image until the video ends.

The multi-target tracking apparatus of this embodiment is used to implement the multi-target tracking apparatus method described above, and therefore, the specific implementation of the multi-target tracking apparatus may be found in the example portions of the multi-target tracking apparatus method described above, for example, the feature extraction module 100, the object detection and appearance extraction module 200, the matching association module 300, the matching success track update module 400, the matching failure track update module 500, and the end judgment module 600, which are respectively used to implement steps S101, S102, S103, S104, S105, and S106 in the multi-target tracking apparatus method described above, so that the specific implementation thereof may refer to the description of the examples of the respective portions thereof, and will not be repeated herein.

The specific embodiment of the invention also provides a computer readable storage medium, wherein the computer readable storage medium stores a computer program, and the computer program realizes the steps of the multi-target tracking method when being executed by a processor.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It is apparent that the above examples are given by way of illustration only and are not limiting of the embodiments. Other variations and modifications of the present invention will be apparent to those of ordinary skill in the art in light of the foregoing description. It is not necessary here nor is it exhaustive of all embodiments. While still being apparent from variations or modifications that may be made by those skilled in the art are within the scope of the invention.

Claims

1. A multi-target tracking method, comprising:

marking the current frame detection frame with failed matching as a new inactive state track; when the quantization score of the inactive state track is greater than or equal to a first preset threshold value, the inactive state track is activated to be an active state track; when the matching of the active state track in a plurality of frames fails, but the quantization score is not lower than a second preset threshold value, the state is changed into a temporary lost state to continue the matching, and if the temporary lost state track is successfully matched again, the state is changed into the active state from the temporary lost state; when the quantization score of the temporary loss state track is smaller than a second preset threshold value, the target is considered to be disappeared in the video sequence, and the track is changed into a discarding state, so that the subsequent matching link is not participated;

performing the above operation on the next frame of image until the video is finished;

the step of calculating the track quantization score of the current frame according to the confidence of the detection frame of the current frame and the track quantization score of the previous frame, and the step of embedding and updating the appearance of the current frame comprises the following steps:

when the matching is successful

The appearance is embedded into ET _t ＝(α-δ)×ET _t-1 +(1-α+δ)×ED _t ；

Wherein ST is track quantization score, SD is detectionFrame confidence, ET and ED are embedded for the appearance of the track and detection frame respectively,

2. The multi-target tracking method of claim 1, wherein the calculating a current frame detection frame and a corresponding current frame appearance embedding vector from the pedestrian feature comprises:

3. The multi-objective tracking method according to claim 2, wherein the step of inputting the pedestrian feature into a channel enhancement feature re-planning module trained with an ensemble model using Adam algorithm, and the step of adaptively dividing the pedestrian feature into a detection subtask feature and a tracking subtask feature comprises:

-extracting said pedestrian feature vector V _cha Using two sets of convolution layers-normalizationAfter the operation of the unification-ReLU activation function-channel confusion, the detection subtask feature vector V is obtained respectively _det And the tracking subtask feature vector V _id ；

Characterizing the pedestrian F _t ∈R ^H×W×C The number of input channels is enlarged by r times, and reconstructed pedestrian characteristics F' are obtained after residual branches of the convolution layer-normalization-ReLU activation function-channel confusion operation are included;

4. The multi-target tracking method of claim 2 wherein said calculating said current frame detection box from said detection subtask features comprises:

According to O _heatmap ，O _offset And O _size Calculating the current frame detection frame D _i ，i∈[1，…，M]M is the number of detection frames of the current frame.

5. The multi-target tracking method of claim 4 wherein said computing the current frame appearance embedding vector from the tracking subtask features comprises:

the tracking subtask features include appearance embedding branches;

Detecting a frame D according to the current frame _i Is at the center point of O _id Corresponding to the corresponding position extraction of the current frame appearance embedded vector ED _i ，i∈[1，…，M]。

6. The multi-target tracking method according to claim 1, wherein the determining whether the current frame image is the first frame comprises:

7. A multi-target tracking apparatus, comprising:

the track updating module is used for calculating the track quantization score of the current frame according to the confidence coefficient of the current frame detection frame and the track quantization score of the previous frame if the matching is successful, embedding the appearance of the current frame for updating, judging the track state, marking the track of the activation state which is successfully matched as a tracking state, and marking the track state which is temporarily lost and successfully matched as an activation state; the step of calculating the track quantization score of the current frame according to the confidence of the detection frame of the current frame and the track quantization score of the previous frame, and the step of embedding and updating the appearance of the current frame comprises the following steps:

when the matching is successful

The appearance is embedded into ET _t ＝(α-δ)×ET _t-1 +(1-α+δ)×ED _t ；

an influence factor obtained by calculating the confidence coefficient of the detection frame, wherein alpha is a constant;

the matching failure track updating module is used for subtracting a preset constant from the previous frame track quantization score to obtain the current frame track quantization score if the matching fails, embedding the current frame appearance into the current frame without updating, marking the current frame detection frame which fails to match as a new inactive state track, resetting the current frame track state according to the current frame track quantization score and a preset threshold value, continuing to match, and discarding the current frame track when the current frame quantization score is smaller than the preset threshold value; when the quantization score of the inactive state track is greater than or equal to a first preset threshold value, the inactive state track is activated to be an active state track; when the matching of the active state track in a plurality of frames fails, but the quantization score is not lower than a second preset threshold value, the state is changed into a temporary lost state to continue the matching, and if the temporary lost state track is successfully matched again, the state is changed into the active state from the temporary lost state; when the quantization score of the temporary loss state track is smaller than a second preset threshold value, the target is considered to be disappeared in the video sequence, and the track is changed into a discarding state, so that the subsequent matching link is not participated;

8. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of a multi-object tracking method according to any of claims 1 to 6.