Disclosure of Invention
Aiming at the technical problems in the prior art, the invention provides an online multi-target tracking method based on fine-grained appearance representation, which focuses on different local details of targets, and the generated fine-grained global representation and local representation complement each other and describe the appearance together. When the target is occluded, the invention can still recognize from visible parts, similar to strategies that humans use when tracking the target. And in the training stage, a sampling strategy of balancing positive and negative samples is adopted, so that the model convergence process is more stable and better performance is obtained.
According to a first aspect of the present invention, there is provided an online multi-target tracking method based on fine-grained appearance representation, comprising: training to obtain an online multi-target tracking model; inputting a video frame to be tested into the online multi-target tracking model after training is completed, and reasoning to realize online multi-target tracking;
the input of the online multi-target tracking model in the training stage is an out-of-order grouping sampling video frame, and the input of the online multi-target tracking model in the reasoning stage is a single video frame to be tested;
after the video frame is input into the online multi-target tracking model, the processing procedure of the online multi-target tracking model comprises the following steps:
detecting targets in the video frames by using a YOLOX detector, and outputting feature maps with different resolutions; acquiring a multi-scale target shallow feature map by operating the feature map through the ROI alignment, learning semantic flows among the target shallow feature maps of different scales, aligning the semantic flows, and outputting fusion features after feature context information is aggregated; the fusion feature comprises a mask feature map and a characterization feature map; taking the mask feature map as input, focusing on different parts of a target by utilizing a multi-branch parallel target part mask generator, generating a target local mask of a corresponding part, and combining the target local masks into a target global mask; generating a target fine granularity embedding by adopting a weighted pooling operation and utilizing the characterization feature map, the target global mask and the target local mask; the target fine granularity embedding comprises: global embedding and a plurality of local embeddings;
in a training stage, the online multi-target tracking model calculates loss according to the global embedding and the local embedding of the target; in the reasoning stage, the online multi-target tracking model splices the global embedding and the local embedding of the target together to be used as the target identity embedding in the association algorithm.
On the basis of the technical scheme, the invention can also make the following improvements.
Optionally, the generating process of the out-of-order packet sampling video frame is as follows:
grouping the video frames according to the sequence, wherein the number of the video frames in each group is the batch size set during training;
after the grouping of the video frames is completed, the sequence among the grouping is disordered, and the sequence of the video frames in the group is maintained.
Optionally, the process of outputting the fusion feature includes:
based on an FPN structure in multi-layer feature fusion, replacing a convolution layer in the FPN structure by using a residual module, replacing addition operation in the FPN structure by using a semantic flow alignment module, and correcting spatial dislocation between the target shallow feature images by learning semantic flows between different scales;
the semantic stream alignment module takes two feature images with different scales as input, converts a small-scale feature image into a feature image with the same size as a large-scale feature image through convolution and up-sampling, splices the two feature images along the channel dimension, and learns the semantic stream by using a convolution layer; and correcting the offset of the original two-scale feature graphs according to the learned semantic stream, unifying the scales, and finally aggregating the context information of the feature graphs through adding operation.
Optionally, the process of generating the target fine granularity embedding includes:
dividing the input mask feature map into a plurality of blocks along a channel dimension, wherein each block of mask feature map is generated by using one branch for masking;
for each branch of the target partial mask generator, generating a corresponding target partial mask by convolution based on Non-Local attention modules and focusing on different Local features of a target; and generating the target fine granularity embedding by adopting a weighted pooling operation and utilizing the characterization feature map, the target global mask and the target local mask.
Optionally, the loss calculated according to the target fine granularity embedding includes: classification loss and Triplet loss;
part of the features of the object are expressed as:
the global features of the object are expressed as:
wherein N represents the number of objects in the image, each object consisting of K parts;
part k characteristic of the nth object, < ->
Global features representing the nth object;
the Triplet loss of soft margin of the partial feature is calculated as:
wherein ,
representing triple loss,/->
Kth partial features representing n objects in an image;
Calculating the Triplet loss of the global feature as:
calculating the classification loss of the partial features as follows:
the classification loss of the global feature is calculated as follows:
wherein ,
m is the number of all targets in the dataset, < >>
Classification loss for the kth partial feature of the nth object, < ->
A classification vector that is a global feature of the nth object.
Optionally, the classification loss of the kth partial feature of the nth object
The calculation process of (1) comprises:
the combination of K linear layers and Softmax is adopted to obtain the classification result vector of each part of the characteristics
;
A classification vector representing the kth partial feature of the nth object, the channel dimension of which is M;
target identity tag tableShown as
;
Is 0 or 1, indicating whether the nth object has the same identity as the mth object in the ID tag;
the classification loss of the kth partial feature of the nth object is calculated as:
optionally, the loss calculated according to the target fine granularity embedding further includes: loss of diversity; the calculation formula of the diversity loss is as follows:
obtaining the final training loss:
wherein ,
,
and
To set coefficients for adjusting the proportions of the different losses.
Optionally, the method for splicing the global embedding and the local embedding as target identity embedding in the association algorithm comprises the following steps:
based on a ByteTrack tracking algorithm, updating the characteristics of the matching track by using an exponential moving average mechanism; features for the ith matching trace in the t-th frame
The updating process is as follows:
wherein ,
for the feature of the current match detection +.>
Is a motion term;
in the tracking algorithm, feature distance matrix
The method comprises the following steps:
wherein ,
output track feature->
And target feature->
A cosine similarity matrix between the two;
calculated distance matrix IoU
:
wherein ,
output detection frame->
And pre-heatingMeasuring frame->
A IoU matrix therebetween;
the feature distance matrix after calculation and optimization is as follows:
the final distance matrix of the association is obtained as follows:
and setting an association threshold value, and judging whether the two features of the adjacent frames are associated according to whether the calculated final distance matrix exceeds the association threshold value.
The invention provides an online multi-target tracking method based on fine granularity appearance representation, which starts from a multi-scale fine granularity feature map, constructs a multi-layer feature fusion module with aligned semantic streams, and uses a multi-branch parallel target part mask generator to obtain a target part fine granularity mask, so as to generate target identity association used in tracking by target fine granularity representation. The video frames are subject to object detection using a high performance detector YOLOX resulting in an object bounding box. And selecting the first three layers of fine-grained feature graphs output by the trunk to carry out the ROIAlign operation, so as to obtain the corresponding target feature graph. And a multi-layer feature fusion module for aligning semantic flows is used for learning semantic flows among the target feature graphs with different scales, the feature graphs with different scales are guided to perform semantic alignment according to the semantic flows, and the context information of the feature graphs is aggregated by using addition operation. The target different parts are focused by adopting a multi-branch parallel target partial mask generator, so that corresponding partial masks are generated and combined into the global partial mask. And finally obtaining the global and local fine granularity characterization. During the training phase, video frames are sampled using out-of-order packets and correlation losses are calculated for training. In the reasoning phase, the correlation algorithm is used for target tracking. The target characterization capability in multi-target tracking is improved, and the tracking performance is effectively improved.
Detailed Description
The principles and features of the present invention are described below with reference to the drawings, the examples are illustrated for the purpose of illustrating the invention and are not to be construed as limiting the scope of the invention.
Fig. 1 is a flowchart of an online multi-target tracking method based on fine-grained appearance, and fig. 2 is a schematic structural diagram of an embodiment of an online multi-target tracking system based on fine-grained appearance, which is provided by the invention, and as can be seen in conjunction with fig. 1 and fig. 2, the online multi-target tracking method includes: training to obtain an online multi-target tracking model; and inputting the video frames to be tested into the trained online multi-target tracking model to perform reasoning so as to realize online multi-target tracking.
The input of the online multi-target tracking model in the training stage is an out-of-order grouping sampling video frame, and the input of the online multi-target tracking model in the reasoning stage is a single video frame to be tested.
After the video frame is input into the online multi-target tracking model, the processing procedure of the online multi-target tracking model comprises the following steps:
detecting targets in the video frames by using a YOLOX detector, and outputting feature maps with different resolutions; obtaining a multi-scale target shallow feature map through the ROIAlign operation feature map, learning semantic flows among different-scale target shallow feature maps, aligning the semantic flows, and outputting fusion features after feature context information is aggregated; the fusion feature comprises a mask feature map and a characterization feature map; taking the mask feature map as input, focusing on different parts of the target by utilizing a multi-branch parallel target part mask generator and generating target local masks of corresponding parts, and combining the target local masks into a target global mask; generating a target fine granularity embedding by adopting a weighted pooling operation and utilizing the characterization feature map, the target global mask and the target local mask; the target fine granularity embedding includes: global embedding and multiple local embeddings.
In the training stage, calculating loss according to global embedding and local embedding of the targets by the online multi-target tracking model; in the reasoning stage, the online multi-target tracking model splices the global embedding and the local embedding of the target as the target identity embedding in the association algorithm.
The invention provides an online Multi-target tracking method based on fine-grained appearance representation, which focuses on the details of targets, constructs a semantic Alignment Multi-scale feature fusion network (Flow Alignment FPN, FAFPN) to align and aggregate high-resolution shallow feature graphs, and inputs the high-resolution shallow feature graphs into a Multi-branch parallel target partial mask generator (Multi-head Part Mask Generator, MPMG) to obtain fine target partial masks. In addition, identity can be more fully represented by combining partial masks of objects into an global mask and extracting fine-grained global-local appearance embeddings. To train Re-IDs more reasonably, the present invention proposes a training strategy named out-of-order packet sampling (ShuffleGroup Sampling, SGS). In this strategy, video frames are grouped into short segments in their order, and then the segments are shuffled, balancing the positive and negative samples while distributing the training data.
Example 1
Embodiment 1 provided by the present invention is an embodiment of an online multi-target tracking method based on fine-grained appearance representation, and as can be seen in fig. 1 and fig. 2, the embodiment of the online multi-target tracking method includes: training to obtain an online multi-target tracking model; and inputting the video frames to be tested into the trained online multi-target tracking model to perform reasoning so as to realize online multi-target tracking. The input of the online multi-target tracking model in the training stage is an out-of-order grouping sampling video frame, and the input of the online multi-target tracking model in the reasoning stage is a single video frame to be tested.
The training phase comprises the following steps:
step 1, using a trained YOLOX detector model, and freezing model parameters of the trunk and detection sections, adding to the fine-grained appearance representation network of the present invention.
And 2, sampling video frames by using out-of-order packets, grouping the video frames into short fragments according to the sequence of the video frames, and then scrambling the fragments to be used as model input in training, wherein the grouping size is the batch size set in training. YOLOX, including CrowdHuman, MOT, citypereson and ETHZ, was trained using the same dataset as ByteTrack, and then the trained detector parameters were frozen and Re-ID trained on the datasets MOT17, MOT20, danceTrack, etc., respectively. The batch size was set to 8 for training Re-ID, and two 1080Ti video cards were used for training. The partial feature dimension and the global feature dimension are 128 and 256, respectively. The target mask number is set to 6. Updating model parameters by using Adam optimizer, wherein the initial learning rate is as follows
And training period was 20 rounds. At round 10, learning rate is reduced to +.>
。
And 3, using a target boundary box label as input, and extracting a feature map of a target boundary box region in the three previous layers of fine-grained feature maps output by the trunk by using an ROI alignment operation to obtain a target feature map with a corresponding scale.
And step 4, taking the multi-scale target feature map as input, using semantic flow aligned multi-layer feature fusion to learn semantic flows among different scale feature maps and aligning, and outputting fusion features including mask feature maps and characterization feature maps after feature context information is aggregated.
In one possible embodiment, step 4 includes: based on the FPN structure commonly used in the multi-layer feature fusion, a residual module is used for replacing a convolution layer in the FPN (Feature Pyramid Networks, feature pyramid network) structure to better extract feature information, and a semantic flow alignment module is used for replacing addition operation in the FPN structure, so that spatial dislocation between each target shallow feature map is corrected by learning semantic flows between different scales.
The semantic stream alignment module takes two feature images with different scales as input, converts a small-scale feature image into a feature image with the same size as a large-scale feature image through convolution and up-sampling, splices the two feature images along the channel dimension, and learns the semantic stream by using a convolution layer; and correcting the offset of the original two-scale feature graphs according to the learned semantic stream, unifying the scales, and finally aggregating the context information of the feature graphs through adding operation.
The overall structure of the FAFPN is shown in fig. 3, and the residual block (ResBlock, RB) and the semantic stream alignment module (Flow Alignment Module, FAM) contained therein are specifically shown in fig. 4 (a) and fig. 4 (b). Residual Blocks (RB) replace 1*1 convolutions in the FPN, which can more fully fuse shallow features. The add operation is replaced with a semantic stream alignment module (Flow Alignment Module, FAM) to enable semantic alignment. The channel dimensions of the multi-scale feature map are unified through the RB, and then FAM is input for alignment and aggregation. Finally, two RBs are used for generating respectively
and
To meet the different requirements of the mask and representation.
Fig. 4 (a) and 4 (b) are specific structures of RB and FAM. RB is a simple residual block. It transforms the channel dimension of the input feature map in combination with 1*1 convolution and BN layers, and then applies the residual structure to get its output. FAM takes as input two feature maps from different scales. After the up-sampling the sample is taken,
size and->
Is the same size. Furthermore, they are connected to generate a semantic stream:
and
They represent the characteristic diagrams +.>
and
Is used for the pixel offset of (a). The Warp operation aligns these feature maps based on semantic streams. Finally, the FAM uses a summation operation to aggregate the context information of the aligned feature maps. After semantic alignment, the multi-scale shallow feature map is aggregated into a high-resolution feature map with accurate semantic information, which is the basis for generating various fine-granularity representations.
Step 5, for the input mask feature map, dividing it into blocks along the channel dimension, each block feature map using one branch to generate the target portion mask. And combining the obtained multiple target partial masks into an overall partial mask, and generating target fine-granularity embedding by using the characterization feature map and the target global and local masks by adopting a weighted pooling operation, wherein the target fine-granularity embedding comprises global embedding and multiple partial embedding.
The process of generating the target fine granularity embedment includes:
dividing an input mask feature map into a plurality of blocks along a channel dimension, wherein each block of mask feature map is generated by masking by using one branch; each block of object feature maps may be considered a different mapping of objects.
For each branch of the target partial mask generator, based on the Non-Local attention module, focusing on different Local features of the target, and generating a corresponding target partial mask by using convolution; and generating target fine granularity embedding by using the characterization feature map, the target global mask and the target local mask by adopting a weighted pooling operation.
As shown in fig. 5, three 1*1 convolutions are used to obtain the query, key and value, respectively, for any branch in the MPMG, i.e., the PMG. The query and key generate an attention matrix, which is multiplied by value and then added to the input. After this, a target partial mask with values between 0 and 1 is generated using the convolution and Sigmoid layers in combination 1*1. The same operation is performed on the remaining feature blocks, resulting in K different target portion masks. And splicing a plurality of target partial masks along the channel dimension, and taking the maximum value in the channel dimension to form a global mask containing different local information. Notably, the parameters are not shared between parallel branches, which ensures sufficient attention diversity.
Step 6, respectively calculating classification loss, triplet loss and diversity loss for the global embedding and the plurality of local embedding.
Part of the features of the object are expressed as:
the global features of the object are expressed as:
wherein N represents the number of objects in the image, each object consisting of K parts;
part k characteristic of the nth object, < ->
Representing the global features of the nth object.
The Triplet loss to calculate the soft margin of the partial feature is:
wherein ,
representing triple loss,/->
Representing the kth partial features of n objects in the image.
Similarly, the Triplet penalty for computing the global feature is:
calculating the classification loss of the partial features as follows:
the classification loss of the global feature is calculated as follows:
wherein ,
m is the number of all targets in the dataset, < >>
Classification loss for the kth partial feature of the nth object, < ->
A classification vector that is a global feature of the nth object.
Classification loss of the kth partial feature of the nth object
The calculation process of (1) comprises:
after obtaining K partial features, obtaining a classification result vector of each partial feature by adopting the combination of K linear layers and Softmax
。
The classification vector representing the kth partial feature of the nth object, whose channel dimension is M, is also the number of all objects in the dataset.
Representing a target identity tag as
。
The value of 0 or 1 indicates whether the nth object has the same identity as the mth object in the ID tag.
The classification loss of the kth partial feature of the nth object is calculated as:
due to the nature of the multi-branch structure, using classification loss and Triplet loss alone does not guarantee that the model focuses on different parts of the object. To avoid multiple branches focusing on similar details, the diversity penalty may be used to reduce the similarity between different component features of the same object, the penalty of embedding the computation according to the object granularity further comprising: loss of diversity; the calculation formula of the diversity loss is as follows:
the purpose of the diversity loss is intuitive, i.e. the cosine similarity between different part features of the same object is as low as possible. Combining the above-mentioned losses to obtain the final training loss:
wherein ,
,
and
To set the coefficients for adjusting the proportions of the different losses, it is possible, for example, to set
。
The reasoning stage comprises:
and step 1', taking a single video frame as input of an online multi-target tracking model in a training stage according to the video frame sequence.
And 2', detecting all targets in the video frame and performing post-processing to obtain a target detection frame and a detection score.
And 3', taking the detection frame as input, and extracting the feature map of the target boundary frame area in the three previous layers of fine-grained feature maps output by the trunk by using the ROIAlign operation to obtain a target feature map with a corresponding scale.
Step 4 'and step 5' are identical to steps 4, 5 in the training phase.
And 6', splicing the obtained global embedding and the plurality of partial embedding according to the channel dimension, and taking the spliced global embedding and the plurality of partial embedding as appearance representation of the target and taking the appearance representation as a basis for judging the identity of the target in the association algorithm.
And 7', updating the characteristics of the matching track by using an index moving average (EMA) mechanism based on a ByteTrack tracking algorithm. Features for the ith matching trace in the t-th frame
The updating process is as follows:
wherein ,
for the feature of the current match detection +.>
Is a motion term.
In the tracking algorithm, feature distance matrix
The method comprises the following steps:
wherein ,
output track feature->
And target feature->
A cosine similarity matrix between them.
At the same time, a distance matrix of IoU can be calculated
:
wherein ,
output detection frame->
And prediction box->
IoU matrix therebetween.
To exclude interference from distant objects, only feature distances between pairs of objects having a IoU distance less than 1 are considered, which means that there is a bounding box overlap. The feature distance matrix after calculation and optimization is as follows:
the product of the optimized characteristic distance matrix and the IoU matrix is squared to obtain a final associated distance matrix, wherein the final associated distance matrix is as follows:
an association threshold is set, and whether the two features of the adjacent frames are associated is determined according to whether the calculated final distance matrix exceeds the association threshold, for example, the association threshold can be set to be 0.5.
The above processes are integrated into a unified multi-target tracking framework, and experiments are performed on MOT17 and MOT20 test sets to verify the effectiveness of the invention. Tracking performance is assessed using CLEAR-MOT indicators, such as HOTA, MOTA, IDF1, IDs, FP, FN, etc. Wherein the HOTA comprehensively evaluates the effects of detection, association and positioning, MOTA represents the correct track proportion of overall tracking, IDF1 represents the identity confidence score of the tracking track, MT represents the track proportion of the tracking track with the effective length exceeding 80%, ML represents the track proportion of the tracking track with the effective length less than 20%, FP represents the number of the background judged as the tracking object, FN represents the number of the background judged as the tracking object, and IDs represents the number of identity conversion in the track.
The tracking results on the MOT17 test set are shown in table 1, and the specific results for each video of MOT17 are shown in table 2. The tracking results on the MOT20 test set are shown in table 3, and the specific results for each video of MOT20 are shown in table 4.
TABLE 1 MOT17 test set end results
Table 2 mot17 test set individual video specific results
TABLE 3 MOT20 test set end results
Table 4 mot20 test set individual video specific results
The embodiment of the invention provides an online multi-target tracking method based on fine granularity appearance representation, which starts from a multi-scale fine granularity feature map, constructs a multi-layer feature fusion module with aligned semantic streams, and uses a multi-branch parallel target part mask generator to obtain a target part fine granularity mask, so as to generate target identity association used in tracking by target fine granularity representation. The video frames are subject to object detection using a high performance detector YOLOX resulting in an object bounding box. And selecting the first three layers of fine-grained feature graphs output by the trunk to carry out the ROIAlign operation, so as to obtain the corresponding target feature graph. And a multi-layer feature fusion module for aligning semantic flows is used for learning semantic flows among the target feature graphs with different scales, the feature graphs with different scales are guided to perform semantic alignment according to the semantic flows, and the context information of the feature graphs is aggregated by using addition operation. The target different parts are focused by adopting a multi-branch parallel target partial mask generator, so that corresponding partial masks are generated and combined into the global partial mask. And finally obtaining the global and local fine granularity characterization. During the training phase, video frames are sampled using out-of-order packets and correlation losses are calculated for training. In the reasoning phase, the correlation algorithm is used for target tracking. The target characterization capability in multi-target tracking is improved, and the tracking performance is effectively improved.
In the foregoing embodiments, the descriptions of the embodiments are focused on, and for those portions of one embodiment that are not described in detail, reference may be made to the related descriptions of other embodiments.
It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the invention.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.