CN115880615B - Online multi-target tracking method based on fine-grained appearance representation - Google Patents

Online multi-target tracking method based on fine-grained appearance representation Download PDF

Info

Publication number
CN115880615B
CN115880615B CN202310127440.2A CN202310127440A CN115880615B CN 115880615 B CN115880615 B CN 115880615B CN 202310127440 A CN202310127440 A CN 202310127440A CN 115880615 B CN115880615 B CN 115880615B
Authority
CN
China
Prior art keywords
target
feature
features
mask
global
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310127440.2A
Other languages
Chinese (zh)
Other versions
CN115880615A (en
Inventor
韩守东
任昊
黄程
王宏伟
王法权
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Tuke Intelligent Information Technology Co ltd
Original Assignee
Wuhan Tuke Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Tuke Intelligent Technology Co ltd filed Critical Wuhan Tuke Intelligent Technology Co ltd
Priority to CN202310127440.2A priority Critical patent/CN115880615B/en
Publication of CN115880615A publication Critical patent/CN115880615A/en
Application granted granted Critical
Publication of CN115880615B publication Critical patent/CN115880615B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Image Analysis (AREA)

Abstract

The invention relates to an online multi-target tracking method based on fine granularity appearance representation, which uses a high-performance detector YOLOX to perform target detection on video frames to obtain a target bounding box. And selecting the first three layers of fine-grained feature graphs output by the trunk to carry out the ROIAlign operation, so as to obtain the corresponding target feature graph. And a multi-layer feature fusion module for aligning semantic flows is used for learning semantic flows among the target feature graphs with different scales, the feature graphs with different scales are guided to perform semantic alignment according to the semantic flows, and the context information of the feature graphs is aggregated by using addition operation. And focusing on different parts of the target by adopting a multi-branch parallel target partial mask generator, generating corresponding partial masks, combining the partial masks into a global partial mask, and acquiring global and local fine grain characterization. During the training phase, out-of-order packet sampling video frames are used to calculate correlation loss for training. In the reasoning phase, the correlation algorithm is used for target tracking. The target characterization capability and tracking performance in multi-target tracking are improved.

Description

Online multi-target tracking method based on fine-grained appearance representation
Technical Field
The invention relates to the field of multi-target tracking in the field of video scene understanding and analysis, in particular to an online multi-target tracking method based on fine-granularity appearance representation.
Background
With the continued development of computer vision technology and related hardware, online high-precision Multi-Object Tracking (MOT) has become possible. As a basic task in computer vision, multi-target tracking plays an important role in scenes such as automatic driving, video monitoring, and the like. For example, in autopilot it is necessary to detect and track all targets in the driving scene using multi-target tracking techniques to ensure the correct feasibility of the driving path, which is critical for autopilot.
MOTs are intended to locate a target and maintain its unique identity. The recent MOT method mainly follows the paradigm of detection-by-detection tracking and separates tracking into two independent steps: and (5) detecting and correlating. The detector first detects the object in each frame and then uses the appearance representation and the position information as a basis for correlation to connect the object with the corresponding track. As an inherent attribute of the target, the appearance and position are complementary in the association.
However, intra-and inter-class occlusion is unavoidable due to object or camera motion, which places more stringent demands on the visual representation. The most recent appearance representation method extracts features of a bounding box region or a center point as appearance embedding. The bounding box-based method adopts global average pooling to convert the features of the target bounding box region into appearance embedding. These methods treat target features and interference features (background and other targets) equally, which is not reasonable. Researchers have noted this problem and utilized the features of the target center as its appearance for embedding, eliminating as much interference from the background as possible. Nevertheless, when the target is occluded, its central feature is inevitably disturbed by noise information from other objects. These coarse-grained global embeddings are very sensitive to noise and thus become unreliable once the signal-to-noise ratio is reduced. The ambiguity of the global representation has become an obstacle to these approaches. As the detector progresses, the appearance representation gradually fails to maintain the same performance as the detection. Some researchers have found that simply using positional cues is sufficient to achieve satisfactory results, while visual cues are detrimental to further improvement. During the training phase, some existing methods train Re-ID (pedestrian Re-recognition) by tracking the video sequence or transforming all video frames. The former does not spread the training data, while the latter is unbalanced in positive and negative samples.
Disclosure of Invention
Aiming at the technical problems in the prior art, the invention provides an online multi-target tracking method based on fine-grained appearance representation, which focuses on different local details of targets, and the generated fine-grained global representation and local representation complement each other and describe the appearance together. When the target is occluded, the invention can still recognize from visible parts, similar to strategies that humans use when tracking the target. And in the training stage, a sampling strategy of balancing positive and negative samples is adopted, so that the model convergence process is more stable and better performance is obtained.
According to a first aspect of the present invention, there is provided an online multi-target tracking method based on fine-grained appearance representation, comprising: training to obtain an online multi-target tracking model; inputting a video frame to be tested into the online multi-target tracking model after training is completed, and reasoning to realize online multi-target tracking;
the input of the online multi-target tracking model in the training stage is an out-of-order grouping sampling video frame, and the input of the online multi-target tracking model in the reasoning stage is a single video frame to be tested;
after the video frame is input into the online multi-target tracking model, the processing procedure of the online multi-target tracking model comprises the following steps:
detecting targets in the video frames by using a YOLOX detector, and outputting feature maps with different resolutions; acquiring a multi-scale target shallow feature map by operating the feature map through the ROI alignment, learning semantic flows among the target shallow feature maps of different scales, aligning the semantic flows, and outputting fusion features after feature context information is aggregated; the fusion feature comprises a mask feature map and a characterization feature map; taking the mask feature map as input, focusing on different parts of a target by utilizing a multi-branch parallel target part mask generator, generating a target local mask of a corresponding part, and combining the target local masks into a target global mask; generating a target fine granularity embedding by adopting a weighted pooling operation and utilizing the characterization feature map, the target global mask and the target local mask; the target fine granularity embedding comprises: global embedding and a plurality of local embeddings;
in a training stage, the online multi-target tracking model calculates loss according to the global embedding and the local embedding of the target; in the reasoning stage, the online multi-target tracking model splices the global embedding and the local embedding of the target together to be used as the target identity embedding in the association algorithm.
On the basis of the technical scheme, the invention can also make the following improvements.
Optionally, the generating process of the out-of-order packet sampling video frame is as follows:
grouping the video frames according to the sequence, wherein the number of the video frames in each group is the batch size set during training;
after the grouping of the video frames is completed, the sequence among the grouping is disordered, and the sequence of the video frames in the group is maintained.
Optionally, the process of outputting the fusion feature includes:
based on an FPN structure in multi-layer feature fusion, replacing a convolution layer in the FPN structure by using a residual module, replacing addition operation in the FPN structure by using a semantic flow alignment module, and correcting spatial dislocation between the target shallow feature images by learning semantic flows between different scales;
the semantic stream alignment module takes two feature images with different scales as input, converts a small-scale feature image into a feature image with the same size as a large-scale feature image through convolution and up-sampling, splices the two feature images along the channel dimension, and learns the semantic stream by using a convolution layer; and correcting the offset of the original two-scale feature graphs according to the learned semantic stream, unifying the scales, and finally aggregating the context information of the feature graphs through adding operation.
Optionally, the process of generating the target fine granularity embedding includes:
dividing the input mask feature map into a plurality of blocks along a channel dimension, wherein each block of mask feature map is generated by using one branch for masking;
for each branch of the target partial mask generator, generating a corresponding target partial mask by convolution based on Non-Local attention modules and focusing on different Local features of a target; and generating the target fine granularity embedding by adopting a weighted pooling operation and utilizing the characterization feature map, the target global mask and the target local mask.
Optionally, the loss calculated according to the target fine granularity embedding includes: classification loss and Triplet loss;
part of the features of the object are expressed as:
Figure SMS_1
the global features of the object are expressed as:
Figure SMS_2
wherein N represents the number of objects in the image, each object consisting of K parts;
Figure SMS_3
part k characteristic of the nth object, < ->
Figure SMS_4
Global features representing the nth object;
the Triplet loss of soft margin of the partial feature is calculated as:
Figure SMS_5
wherein ,
Figure SMS_6
representing triple loss,/->
Figure SMS_7
Kth partial features representing n objects in an image;
Calculating the Triplet loss of the global feature as:
Figure SMS_8
calculating the classification loss of the partial features as follows:
Figure SMS_9
the classification loss of the global feature is calculated as follows:
Figure SMS_10
wherein ,
Figure SMS_11
m is the number of all targets in the dataset, < >>
Figure SMS_12
Classification loss for the kth partial feature of the nth object, < ->
Figure SMS_13
A classification vector that is a global feature of the nth object.
Optionally, the classification loss of the kth partial feature of the nth object
Figure SMS_14
The calculation process of (1) comprises:
the combination of K linear layers and Softmax is adopted to obtain the classification result vector of each part of the characteristics
Figure SMS_15
Figure SMS_16
A classification vector representing the kth partial feature of the nth object, the channel dimension of which is M;
target identity tag tableShown as
Figure SMS_17
Figure SMS_18
Is 0 or 1, indicating whether the nth object has the same identity as the mth object in the ID tag;
the classification loss of the kth partial feature of the nth object is calculated as:
Figure SMS_19
optionally, the loss calculated according to the target fine granularity embedding further includes: loss of diversity; the calculation formula of the diversity loss is as follows:
Figure SMS_20
obtaining the final training loss:
Figure SMS_21
wherein ,
Figure SMS_22
Figure SMS_23
and
Figure SMS_24
To set coefficients for adjusting the proportions of the different losses.
Optionally, the method for splicing the global embedding and the local embedding as target identity embedding in the association algorithm comprises the following steps:
based on a ByteTrack tracking algorithm, updating the characteristics of the matching track by using an exponential moving average mechanism; features for the ith matching trace in the t-th frame
Figure SMS_25
The updating process is as follows:
Figure SMS_26
wherein ,
Figure SMS_27
for the feature of the current match detection +.>
Figure SMS_28
Is a motion term;
in the tracking algorithm, feature distance matrix
Figure SMS_29
The method comprises the following steps:
Figure SMS_30
wherein ,
Figure SMS_31
output track feature->
Figure SMS_32
And target feature->
Figure SMS_33
A cosine similarity matrix between the two;
calculated distance matrix IoU
Figure SMS_34
Figure SMS_35
wherein ,
Figure SMS_36
output detection frame->
Figure SMS_37
And pre-heatingMeasuring frame->
Figure SMS_38
A IoU matrix therebetween;
the feature distance matrix after calculation and optimization is as follows:
Figure SMS_39
the final distance matrix of the association is obtained as follows:
Figure SMS_40
and setting an association threshold value, and judging whether the two features of the adjacent frames are associated according to whether the calculated final distance matrix exceeds the association threshold value.
The invention provides an online multi-target tracking method based on fine granularity appearance representation, which starts from a multi-scale fine granularity feature map, constructs a multi-layer feature fusion module with aligned semantic streams, and uses a multi-branch parallel target part mask generator to obtain a target part fine granularity mask, so as to generate target identity association used in tracking by target fine granularity representation. The video frames are subject to object detection using a high performance detector YOLOX resulting in an object bounding box. And selecting the first three layers of fine-grained feature graphs output by the trunk to carry out the ROIAlign operation, so as to obtain the corresponding target feature graph. And a multi-layer feature fusion module for aligning semantic flows is used for learning semantic flows among the target feature graphs with different scales, the feature graphs with different scales are guided to perform semantic alignment according to the semantic flows, and the context information of the feature graphs is aggregated by using addition operation. The target different parts are focused by adopting a multi-branch parallel target partial mask generator, so that corresponding partial masks are generated and combined into the global partial mask. And finally obtaining the global and local fine granularity characterization. During the training phase, video frames are sampled using out-of-order packets and correlation losses are calculated for training. In the reasoning phase, the correlation algorithm is used for target tracking. The target characterization capability in multi-target tracking is improved, and the tracking performance is effectively improved.
Drawings
FIG. 1 is a flow chart of an embodiment of an online multi-objective tracking method based on fine grain appearance representation provided by the present invention;
FIG. 2 is a schematic diagram of an embodiment of an online multi-target tracking system based on fine grain appearance representation provided by the present invention;
FIG. 3 is a schematic diagram of a semantic aligned multi-scale feature fusion network structure according to an embodiment of the present invention;
fig. 4 (a) is a schematic diagram of an RB structure provided by an embodiment of the present invention;
fig. 4 (b) is a schematic structural diagram of a FAM according to an embodiment of the present invention;
fig. 5 is a schematic diagram of a network structure of a multi-branch parallel target portion mask generator according to an embodiment of the present invention.
Detailed Description
The principles and features of the present invention are described below with reference to the drawings, the examples are illustrated for the purpose of illustrating the invention and are not to be construed as limiting the scope of the invention.
Fig. 1 is a flowchart of an online multi-target tracking method based on fine-grained appearance, and fig. 2 is a schematic structural diagram of an embodiment of an online multi-target tracking system based on fine-grained appearance, which is provided by the invention, and as can be seen in conjunction with fig. 1 and fig. 2, the online multi-target tracking method includes: training to obtain an online multi-target tracking model; and inputting the video frames to be tested into the trained online multi-target tracking model to perform reasoning so as to realize online multi-target tracking.
The input of the online multi-target tracking model in the training stage is an out-of-order grouping sampling video frame, and the input of the online multi-target tracking model in the reasoning stage is a single video frame to be tested.
After the video frame is input into the online multi-target tracking model, the processing procedure of the online multi-target tracking model comprises the following steps:
detecting targets in the video frames by using a YOLOX detector, and outputting feature maps with different resolutions; obtaining a multi-scale target shallow feature map through the ROIAlign operation feature map, learning semantic flows among different-scale target shallow feature maps, aligning the semantic flows, and outputting fusion features after feature context information is aggregated; the fusion feature comprises a mask feature map and a characterization feature map; taking the mask feature map as input, focusing on different parts of the target by utilizing a multi-branch parallel target part mask generator and generating target local masks of corresponding parts, and combining the target local masks into a target global mask; generating a target fine granularity embedding by adopting a weighted pooling operation and utilizing the characterization feature map, the target global mask and the target local mask; the target fine granularity embedding includes: global embedding and multiple local embeddings.
In the training stage, calculating loss according to global embedding and local embedding of the targets by the online multi-target tracking model; in the reasoning stage, the online multi-target tracking model splices the global embedding and the local embedding of the target as the target identity embedding in the association algorithm.
The invention provides an online Multi-target tracking method based on fine-grained appearance representation, which focuses on the details of targets, constructs a semantic Alignment Multi-scale feature fusion network (Flow Alignment FPN, FAFPN) to align and aggregate high-resolution shallow feature graphs, and inputs the high-resolution shallow feature graphs into a Multi-branch parallel target partial mask generator (Multi-head Part Mask Generator, MPMG) to obtain fine target partial masks. In addition, identity can be more fully represented by combining partial masks of objects into an global mask and extracting fine-grained global-local appearance embeddings. To train Re-IDs more reasonably, the present invention proposes a training strategy named out-of-order packet sampling (ShuffleGroup Sampling, SGS). In this strategy, video frames are grouped into short segments in their order, and then the segments are shuffled, balancing the positive and negative samples while distributing the training data.
Example 1
Embodiment 1 provided by the present invention is an embodiment of an online multi-target tracking method based on fine-grained appearance representation, and as can be seen in fig. 1 and fig. 2, the embodiment of the online multi-target tracking method includes: training to obtain an online multi-target tracking model; and inputting the video frames to be tested into the trained online multi-target tracking model to perform reasoning so as to realize online multi-target tracking. The input of the online multi-target tracking model in the training stage is an out-of-order grouping sampling video frame, and the input of the online multi-target tracking model in the reasoning stage is a single video frame to be tested.
The training phase comprises the following steps:
step 1, using a trained YOLOX detector model, and freezing model parameters of the trunk and detection sections, adding to the fine-grained appearance representation network of the present invention.
And 2, sampling video frames by using out-of-order packets, grouping the video frames into short fragments according to the sequence of the video frames, and then scrambling the fragments to be used as model input in training, wherein the grouping size is the batch size set in training. YOLOX, including CrowdHuman, MOT, citypereson and ETHZ, was trained using the same dataset as ByteTrack, and then the trained detector parameters were frozen and Re-ID trained on the datasets MOT17, MOT20, danceTrack, etc., respectively. The batch size was set to 8 for training Re-ID, and two 1080Ti video cards were used for training. The partial feature dimension and the global feature dimension are 128 and 256, respectively. The target mask number is set to 6. Updating model parameters by using Adam optimizer, wherein the initial learning rate is as follows
Figure SMS_41
And training period was 20 rounds. At round 10, learning rate is reduced to +.>
Figure SMS_42
And 3, using a target boundary box label as input, and extracting a feature map of a target boundary box region in the three previous layers of fine-grained feature maps output by the trunk by using an ROI alignment operation to obtain a target feature map with a corresponding scale.
And step 4, taking the multi-scale target feature map as input, using semantic flow aligned multi-layer feature fusion to learn semantic flows among different scale feature maps and aligning, and outputting fusion features including mask feature maps and characterization feature maps after feature context information is aggregated.
In one possible embodiment, step 4 includes: based on the FPN structure commonly used in the multi-layer feature fusion, a residual module is used for replacing a convolution layer in the FPN (Feature Pyramid Networks, feature pyramid network) structure to better extract feature information, and a semantic flow alignment module is used for replacing addition operation in the FPN structure, so that spatial dislocation between each target shallow feature map is corrected by learning semantic flows between different scales.
The semantic stream alignment module takes two feature images with different scales as input, converts a small-scale feature image into a feature image with the same size as a large-scale feature image through convolution and up-sampling, splices the two feature images along the channel dimension, and learns the semantic stream by using a convolution layer; and correcting the offset of the original two-scale feature graphs according to the learned semantic stream, unifying the scales, and finally aggregating the context information of the feature graphs through adding operation.
The overall structure of the FAFPN is shown in fig. 3, and the residual block (ResBlock, RB) and the semantic stream alignment module (Flow Alignment Module, FAM) contained therein are specifically shown in fig. 4 (a) and fig. 4 (b). Residual Blocks (RB) replace 1*1 convolutions in the FPN, which can more fully fuse shallow features. The add operation is replaced with a semantic stream alignment module (Flow Alignment Module, FAM) to enable semantic alignment. The channel dimensions of the multi-scale feature map are unified through the RB, and then FAM is input for alignment and aggregation. Finally, two RBs are used for generating respectively
Figure SMS_43
and
Figure SMS_44
To meet the different requirements of the mask and representation.
Fig. 4 (a) and 4 (b) are specific structures of RB and FAM. RB is a simple residual block. It transforms the channel dimension of the input feature map in combination with 1*1 convolution and BN layers, and then applies the residual structure to get its output. FAM takes as input two feature maps from different scales. After the up-sampling the sample is taken,
Figure SMS_45
size and->
Figure SMS_46
Is the same size. Furthermore, they are connected to generate a semantic stream:
Figure SMS_47
and
Figure SMS_48
They represent the characteristic diagrams +.>
Figure SMS_49
and
Figure SMS_50
Is used for the pixel offset of (a). The Warp operation aligns these feature maps based on semantic streams. Finally, the FAM uses a summation operation to aggregate the context information of the aligned feature maps. After semantic alignment, the multi-scale shallow feature map is aggregated into a high-resolution feature map with accurate semantic information, which is the basis for generating various fine-granularity representations.
Step 5, for the input mask feature map, dividing it into blocks along the channel dimension, each block feature map using one branch to generate the target portion mask. And combining the obtained multiple target partial masks into an overall partial mask, and generating target fine-granularity embedding by using the characterization feature map and the target global and local masks by adopting a weighted pooling operation, wherein the target fine-granularity embedding comprises global embedding and multiple partial embedding.
The process of generating the target fine granularity embedment includes:
dividing an input mask feature map into a plurality of blocks along a channel dimension, wherein each block of mask feature map is generated by masking by using one branch; each block of object feature maps may be considered a different mapping of objects.
For each branch of the target partial mask generator, based on the Non-Local attention module, focusing on different Local features of the target, and generating a corresponding target partial mask by using convolution; and generating target fine granularity embedding by using the characterization feature map, the target global mask and the target local mask by adopting a weighted pooling operation.
As shown in fig. 5, three 1*1 convolutions are used to obtain the query, key and value, respectively, for any branch in the MPMG, i.e., the PMG. The query and key generate an attention matrix, which is multiplied by value and then added to the input. After this, a target partial mask with values between 0 and 1 is generated using the convolution and Sigmoid layers in combination 1*1. The same operation is performed on the remaining feature blocks, resulting in K different target portion masks. And splicing a plurality of target partial masks along the channel dimension, and taking the maximum value in the channel dimension to form a global mask containing different local information. Notably, the parameters are not shared between parallel branches, which ensures sufficient attention diversity.
Step 6, respectively calculating classification loss, triplet loss and diversity loss for the global embedding and the plurality of local embedding.
Part of the features of the object are expressed as:
Figure SMS_51
the global features of the object are expressed as:
Figure SMS_52
wherein N represents the number of objects in the image, each object consisting of K parts;
Figure SMS_53
part k characteristic of the nth object, < ->
Figure SMS_54
Representing the global features of the nth object.
The Triplet loss to calculate the soft margin of the partial feature is:
Figure SMS_55
wherein ,
Figure SMS_56
representing triple loss,/->
Figure SMS_57
Representing the kth partial features of n objects in the image.
Similarly, the Triplet penalty for computing the global feature is:
Figure SMS_58
calculating the classification loss of the partial features as follows:
Figure SMS_59
the classification loss of the global feature is calculated as follows:
Figure SMS_60
wherein ,
Figure SMS_61
m is the number of all targets in the dataset, < >>
Figure SMS_62
Classification loss for the kth partial feature of the nth object, < ->
Figure SMS_63
A classification vector that is a global feature of the nth object.
Classification loss of the kth partial feature of the nth object
Figure SMS_64
The calculation process of (1) comprises:
after obtaining K partial features, obtaining a classification result vector of each partial feature by adopting the combination of K linear layers and Softmax
Figure SMS_65
Figure SMS_66
The classification vector representing the kth partial feature of the nth object, whose channel dimension is M, is also the number of all objects in the dataset.
Representing a target identity tag as
Figure SMS_67
Figure SMS_68
The value of 0 or 1 indicates whether the nth object has the same identity as the mth object in the ID tag.
The classification loss of the kth partial feature of the nth object is calculated as:
Figure SMS_69
due to the nature of the multi-branch structure, using classification loss and Triplet loss alone does not guarantee that the model focuses on different parts of the object. To avoid multiple branches focusing on similar details, the diversity penalty may be used to reduce the similarity between different component features of the same object, the penalty of embedding the computation according to the object granularity further comprising: loss of diversity; the calculation formula of the diversity loss is as follows:
Figure SMS_70
the purpose of the diversity loss is intuitive, i.e. the cosine similarity between different part features of the same object is as low as possible. Combining the above-mentioned losses to obtain the final training loss:
Figure SMS_71
wherein ,
Figure SMS_72
Figure SMS_73
and
Figure SMS_74
To set the coefficients for adjusting the proportions of the different losses, it is possible, for example, to set
Figure SMS_75
The reasoning stage comprises:
and step 1', taking a single video frame as input of an online multi-target tracking model in a training stage according to the video frame sequence.
And 2', detecting all targets in the video frame and performing post-processing to obtain a target detection frame and a detection score.
And 3', taking the detection frame as input, and extracting the feature map of the target boundary frame area in the three previous layers of fine-grained feature maps output by the trunk by using the ROIAlign operation to obtain a target feature map with a corresponding scale.
Step 4 'and step 5' are identical to steps 4, 5 in the training phase.
And 6', splicing the obtained global embedding and the plurality of partial embedding according to the channel dimension, and taking the spliced global embedding and the plurality of partial embedding as appearance representation of the target and taking the appearance representation as a basis for judging the identity of the target in the association algorithm.
And 7', updating the characteristics of the matching track by using an index moving average (EMA) mechanism based on a ByteTrack tracking algorithm. Features for the ith matching trace in the t-th frame
Figure SMS_76
The updating process is as follows:
Figure SMS_77
wherein ,
Figure SMS_78
for the feature of the current match detection +.>
Figure SMS_79
Is a motion term.
In the tracking algorithm, feature distance matrix
Figure SMS_80
The method comprises the following steps:
Figure SMS_81
wherein ,
Figure SMS_82
output track feature->
Figure SMS_83
And target feature->
Figure SMS_84
A cosine similarity matrix between them.
At the same time, a distance matrix of IoU can be calculated
Figure SMS_85
Figure SMS_86
wherein ,
Figure SMS_87
output detection frame->
Figure SMS_88
And prediction box->
Figure SMS_89
IoU matrix therebetween.
To exclude interference from distant objects, only feature distances between pairs of objects having a IoU distance less than 1 are considered, which means that there is a bounding box overlap. The feature distance matrix after calculation and optimization is as follows:
Figure SMS_90
the product of the optimized characteristic distance matrix and the IoU matrix is squared to obtain a final associated distance matrix, wherein the final associated distance matrix is as follows:
Figure SMS_91
an association threshold is set, and whether the two features of the adjacent frames are associated is determined according to whether the calculated final distance matrix exceeds the association threshold, for example, the association threshold can be set to be 0.5.
The above processes are integrated into a unified multi-target tracking framework, and experiments are performed on MOT17 and MOT20 test sets to verify the effectiveness of the invention. Tracking performance is assessed using CLEAR-MOT indicators, such as HOTA, MOTA, IDF1, IDs, FP, FN, etc. Wherein the HOTA comprehensively evaluates the effects of detection, association and positioning, MOTA represents the correct track proportion of overall tracking, IDF1 represents the identity confidence score of the tracking track, MT represents the track proportion of the tracking track with the effective length exceeding 80%, ML represents the track proportion of the tracking track with the effective length less than 20%, FP represents the number of the background judged as the tracking object, FN represents the number of the background judged as the tracking object, and IDs represents the number of identity conversion in the track.
The tracking results on the MOT17 test set are shown in table 1, and the specific results for each video of MOT17 are shown in table 2. The tracking results on the MOT20 test set are shown in table 3, and the specific results for each video of MOT20 are shown in table 4.
TABLE 1 MOT17 test set end results
Figure SMS_92
Table 2 mot17 test set individual video specific results
Figure SMS_93
TABLE 3 MOT20 test set end results
Figure SMS_94
Table 4 mot20 test set individual video specific results
Figure SMS_95
The embodiment of the invention provides an online multi-target tracking method based on fine granularity appearance representation, which starts from a multi-scale fine granularity feature map, constructs a multi-layer feature fusion module with aligned semantic streams, and uses a multi-branch parallel target part mask generator to obtain a target part fine granularity mask, so as to generate target identity association used in tracking by target fine granularity representation. The video frames are subject to object detection using a high performance detector YOLOX resulting in an object bounding box. And selecting the first three layers of fine-grained feature graphs output by the trunk to carry out the ROIAlign operation, so as to obtain the corresponding target feature graph. And a multi-layer feature fusion module for aligning semantic flows is used for learning semantic flows among the target feature graphs with different scales, the feature graphs with different scales are guided to perform semantic alignment according to the semantic flows, and the context information of the feature graphs is aggregated by using addition operation. The target different parts are focused by adopting a multi-branch parallel target partial mask generator, so that corresponding partial masks are generated and combined into the global partial mask. And finally obtaining the global and local fine granularity characterization. During the training phase, video frames are sampled using out-of-order packets and correlation losses are calculated for training. In the reasoning phase, the correlation algorithm is used for target tracking. The target characterization capability in multi-target tracking is improved, and the tracking performance is effectively improved.
In the foregoing embodiments, the descriptions of the embodiments are focused on, and for those portions of one embodiment that are not described in detail, reference may be made to the related descriptions of other embodiments.
It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the invention.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims (6)

1. An online multi-target tracking method based on fine-grained appearance representation is characterized in that the online multi-target tracking method comprises the following steps: training to obtain an online multi-target tracking model; inputting the video frame to be tested into the trained online multi-target tracking model to perform reasoning so as to realize online multi-target tracking;
the input of the online multi-target tracking model in the training stage is an out-of-order grouping sampling video frame, and the input of the online multi-target tracking model in the reasoning stage is a single video frame to be tested;
after the video frame is input into the online multi-target tracking model, the processing procedure of the online multi-target tracking model comprises the following steps:
detecting targets in the video frames by using a YOLOX detector, and outputting feature maps with different resolutions; acquiring a multi-scale target shallow feature map through the ROI alignment operation feature map, learning semantic flows among different-scale target shallow feature maps, aligning, and outputting fusion features after feature context information is aggregated; the fusion feature comprises a mask feature map and a characterization feature map; taking the mask feature map as input, focusing on different parts of the target by utilizing a multi-branch parallel target part mask generator and generating target local masks of corresponding parts, and combining the target local masks into a target global mask; generating a target fine granularity feature by adopting a weighted pooling operation and utilizing the characteristic feature map, the target global mask and the target local mask; the target fine grain features include: global features and a plurality of local features;
in the training stage, the online multi-target tracking model calculates loss according to global features and local features of targets; in the reasoning stage, the online multi-target tracking model splices the global features and the local features of the targets to be used as target identity features in a correlation algorithm;
the process of outputting the fusion features comprises:
based on an FPN structure in multi-layer feature fusion, replacing a convolution layer in the FPN structure by using a residual module, replacing addition operation in the FPN structure by using a semantic flow alignment module, and correcting spatial dislocation between the target shallow feature images by learning semantic flows between different scales; generating a mask feature map and a characterization feature map respectively by using two residual error modules;
the semantic stream alignment module takes two feature images with different scales as input, converts a small-scale feature image into a feature image with the same size as a large-scale feature image through convolution and up-sampling, splices the two feature images along the channel dimension, and learns the semantic stream by using a convolution layer; correcting the offset of the original two-scale feature images according to the learned semantic stream, unifying the scales, and finally aggregating the context information of the feature images through adding operation;
the process of generating the target fine granularity characteristic comprises the following steps:
dividing the input mask feature map into a plurality of blocks along a channel dimension, wherein each block of mask feature map is generated by using one branch for masking;
for each branch of the target partial mask generator, generating a corresponding target partial mask by convolution based on Non-Local attention modules and focusing on different Local features of a target; and generating the target fine-grained feature by using the characterization feature map, the target global mask and the target local mask by adopting a weighted pooling operation.
2. The online multi-target tracking method of claim 1, wherein the generating process of the out-of-order packet sampled video frame is:
grouping the video frames according to the sequence, wherein the number of the video frames in each group is the batch size set during training;
after the grouping of the video frames is completed, the sequence among the grouping is disordered, and the sequence of the video frames in the group is maintained.
3. The online multi-target tracking method of claim 1, wherein the loss calculated from the target fine-grained feature comprises: classification loss and Triplet loss;
part of the features of the object are expressed as:
Figure QLYQS_1
the global features of the object are expressed as:
Figure QLYQS_2
wherein N represents the number of objects in the image, each object consisting of K parts;
Figure QLYQS_3
part k characteristic of the nth object, < ->
Figure QLYQS_4
Global features representing the nth object;
the Triplet loss of soft margin of the partial feature is calculated as:
Figure QLYQS_5
wherein ,
Figure QLYQS_6
representing triple loss,/->
Figure QLYQS_7
A kth partial feature representing n objects in the image;
calculating the Triplet loss of the global feature as:
Figure QLYQS_8
calculating the classification loss of the partial features as follows:
Figure QLYQS_9
the classification loss of the global feature is calculated as follows:
Figure QLYQS_10
wherein ,
Figure QLYQS_11
m is the number of all targets in the dataset, < >>
Figure QLYQS_12
Classification loss for the kth partial feature of the nth object, < ->
Figure QLYQS_13
A classification vector that is a global feature of the nth object.
4. The online multi-target tracking method of claim 3 wherein the classification loss of the kth partial feature of the nth target
Figure QLYQS_14
The calculation process of (1) comprises:
the combination of K linear layers and Softmax is adopted to obtain the classification result vector of each part of the characteristics
Figure QLYQS_15
;/>
Figure QLYQS_16
A classification vector representing the kth partial feature of the nth object, the channel dimension of which is M;
the target identity tag is expressed as
Figure QLYQS_17
Figure QLYQS_18
Is 0 or 1, indicating whether the nth object has the same identity as the mth object in the ID tag;
the classification loss of the kth partial feature of the nth object is calculated as:
Figure QLYQS_19
5. the online multi-target tracking method of claim 3, wherein the loss calculated from the target fine-grained feature further comprises: loss of diversity; the calculation formula of the diversity loss is as follows:
Figure QLYQS_20
obtaining the final training loss:
Figure QLYQS_21
wherein ,
Figure QLYQS_22
Figure QLYQS_23
and
Figure QLYQS_24
To set coefficients for adjusting the proportions of the different losses.
6. The online multi-target tracking method of claim 1, wherein stitching the global features and local features together as target identity features in an association algorithm further comprises:
based on a ByteTrack tracking algorithm, updating the characteristics of the matching track by using an exponential moving average mechanism; features for the ith matching trace in the t-th frame
Figure QLYQS_25
The updating process is as follows:
Figure QLYQS_26
wherein ,
Figure QLYQS_27
for the feature of the current match detection +.>
Figure QLYQS_28
Is a motion term;
in the tracking algorithm, feature distance matrix
Figure QLYQS_29
The method comprises the following steps:
Figure QLYQS_30
wherein ,
Figure QLYQS_31
output track feature->
Figure QLYQS_32
And target feature->
Figure QLYQS_33
A cosine similarity matrix between the two; />
Calculated distance matrix IoU
Figure QLYQS_34
Figure QLYQS_35
wherein ,
Figure QLYQS_36
output detection frame->
Figure QLYQS_37
And prediction box->
Figure QLYQS_38
A IoU matrix therebetween;
the feature distance matrix after calculation and optimization is as follows:
Figure QLYQS_39
the final distance matrix of the association is obtained as follows:
Figure QLYQS_40
and setting an association threshold value, and judging whether the two features of the adjacent frames are associated according to whether the calculated final distance matrix exceeds the association threshold value.
CN202310127440.2A 2023-02-17 2023-02-17 Online multi-target tracking method based on fine-grained appearance representation Active CN115880615B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310127440.2A CN115880615B (en) 2023-02-17 2023-02-17 Online multi-target tracking method based on fine-grained appearance representation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310127440.2A CN115880615B (en) 2023-02-17 2023-02-17 Online multi-target tracking method based on fine-grained appearance representation

Publications (2)

Publication Number Publication Date
CN115880615A CN115880615A (en) 2023-03-31
CN115880615B true CN115880615B (en) 2023-05-09

Family

ID=85761279

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310127440.2A Active CN115880615B (en) 2023-02-17 2023-02-17 Online multi-target tracking method based on fine-grained appearance representation

Country Status (1)

Country Link
CN (1) CN115880615B (en)

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116916080A (en) * 2019-05-17 2023-10-20 上海哔哩哔哩科技有限公司 Video data processing method, device, computer equipment and readable storage medium
CN111460926B (en) * 2020-03-16 2022-10-14 华中科技大学 Video pedestrian detection method fusing multi-target tracking clues
US11544503B2 (en) * 2020-04-06 2023-01-03 Adobe Inc. Domain alignment for object detection domain adaptation tasks
CN111652899B (en) * 2020-05-29 2023-11-14 中国矿业大学 Video target segmentation method for space-time component diagram
CN114186564B (en) * 2021-11-05 2023-11-24 北京百度网讯科技有限公司 Pre-training method and device for semantic representation model and electronic equipment
CN115272404B (en) * 2022-06-17 2023-07-18 江南大学 Multi-target tracking method based on kernel space and implicit space feature alignment

Also Published As

Publication number Publication date
CN115880615A (en) 2023-03-31

Similar Documents

Publication Publication Date Title
Fischer et al. Qdtrack: Quasi-dense similarity learning for appearance-only multiple object tracking
CN113807187A (en) Unmanned aerial vehicle video multi-target tracking method based on attention feature fusion
Mei et al. Hdinet: Hierarchical dual-sensor interaction network for rgbt tracking
Saribas et al. TRAT: Tracking by attention using spatio-temporal features
CN113838092A (en) Pedestrian tracking method and system
CN115330837A (en) Robust target tracking method and system based on graph attention Transformer network
CN108447082A (en) A kind of objective matching process based on combination learning Keypoint detector
Khan et al. Clip: Train faster with less data
CN116664867B (en) Feature extraction method and device for selecting training samples based on multi-evidence fusion
Xue et al. Consistent Representation Mining for Multi-Drone Single Object Tracking
CN115880615B (en) Online multi-target tracking method based on fine-grained appearance representation
CN112257638A (en) Image comparison method, system, equipment and computer readable storage medium
CN117115474A (en) End-to-end single target tracking method based on multi-stage feature extraction
Li et al. MULS-Net: A Multilevel Supervised Network for Ship Tracking From Low-Resolution Remote-Sensing Image Sequences
Luo et al. Real-time multi-object tracking based on bi-directional matching
Gong et al. ASAFormer: Visual tracking with convolutional vision transformer and asymmetric selective attention
Gong et al. Visual tracking with pyramidal feature fusion and transformer based model predictor
CN111259701A (en) Pedestrian re-identification method and device and electronic equipment
Chen et al. Siamese network algorithm based on multi-scale channel attention fusion and multi-scale depth-wise cross correlation
Chen Appearance Awared Detector for MOT: An Enhanced ReID Branch for Tracking Memorize
CN111681260A (en) Multi-target tracking method and tracking system for aerial images of unmanned aerial vehicle
Bai et al. Pedestrian Tracking and Trajectory Analysis for Security Monitoring
CN117576167B (en) Multi-target tracking method, multi-target tracking device, and computer storage medium
Prathima et al. Detection of Armed Assailants in Hostage Situations-A Machine Learning based approach
Xia et al. TTD-YOLO: A Real-time Traffic Target Detection Algorithm Based on YOLOV5

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP03 Change of name, title or address

Address after: No. 548, 5th Floor, Building 10, No. 28 Linping Avenue, Donghu Street, Linping District, Hangzhou City, Zhejiang Province

Patentee after: Hangzhou Tuke Intelligent Information Technology Co.,Ltd.

Country or region after: China

Address before: 430000 B033, No. 05, 4th floor, building 2, international enterprise center, No. 1, Guanggu Avenue, Donghu New Technology Development Zone, Wuhan, Hubei (Wuhan area of free trade zone)

Patentee before: Wuhan Tuke Intelligent Technology Co.,Ltd.

Country or region before: China

CP03 Change of name, title or address