CN117173607A - Multi-level fusion multi-target tracking method, system and computer readable storage medium - Google Patents

Multi-level fusion multi-target tracking method, system and computer readable storage medium Download PDF

Info

Publication number
CN117173607A
CN117173607A CN202311018804.XA CN202311018804A CN117173607A CN 117173607 A CN117173607 A CN 117173607A CN 202311018804 A CN202311018804 A CN 202311018804A CN 117173607 A CN117173607 A CN 117173607A
Authority
CN
China
Prior art keywords
target
node
sub
level
segmentation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311018804.XA
Other languages
Chinese (zh)
Inventor
刘宪钦
张逸君
张伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Jiamaotong Technology Co ltd
Guangdong Branch Of National Customs Information Center
Original Assignee
Guangdong Jiamaotong Technology Co ltd
Guangdong Branch Of National Customs Information Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Jiamaotong Technology Co ltd, Guangdong Branch Of National Customs Information Center filed Critical Guangdong Jiamaotong Technology Co ltd
Priority to CN202311018804.XA priority Critical patent/CN117173607A/en
Publication of CN117173607A publication Critical patent/CN117173607A/en
Pending legal-status Critical Current

Links

Landscapes

  • Image Analysis (AREA)

Abstract

The application discloses a multi-level fusion multi-target tracking method, a system and a computer readable storage medium, which comprise the following steps: extracting target re-identification characteristics by combining a zero-order segmentation network, processing a target relationship graph neural network, and fusing multi-layer long-range tracks; a sub-method one, comprising: dividing and preprocessing a human body target based on a dividing model SAM of a transducer; inputting the target picture frame after the segmentation pretreatment into a pre-training re-recognition network, and extracting re-recognition characteristics of a human body target; a second sub-method, comprising: giving a continuous video segment and a corresponding detection frame set, and constructing a target relation graph model by using GNN; a sub-method three, comprising: training the target relation graph models in the second sub-methods according to different levels of sequences, and linking video clips with different sizes by using the target relation graph models. The method has the effect of improving the feature recognition accuracy of multi-target tracking.

Description

Multi-level fusion multi-target tracking method, system and computer readable storage medium
Technical Field
The application relates to the technical field of machine vision, in particular to a multi-level fusion multi-target tracking method, a system and a computer readable storage medium.
Background
Multi-object tracking (Multiple Object Tracking) is one of the fundamental tasks of computer vision, whose object is to track multiple objects over time in a video sequence. The core goal of multi-target tracking is to handle various challenges such as occlusion, motion blur, and light variation while still maintaining the identity of each object between successive frames.
Most multi-target Tracking algorithms take the paradigm of TBD (Tracking-by-Detection): first detecting a target in each frame; the detected objects frame by frame are then correlated to form a trajectory.
In the environment where high-precision object detection exists, data correlation mainly occurs between detection objects that are close in time, so-called short-term correlation (short-term). In general, simple cues like position, motion related proximity or local appearance are sufficient to ensure accurate correlation.
However, in crowded scenes, we face different challenges, such as objects may often be occluded and not detected for a few frames of time. This requires us to detect correlations between time frames that are long-term correlations.
Given the differences in the nature of these tasks, solutions commonly used to handle short-term associations tend to be difficult to handle in long-term association scenarios.
In response to long-term associated challenges, the existing visual information-based tracking method has the main problems that the extracted visual characteristics of the appearance of the target are not strong in characterization capability, and the extracted visual characteristics are difficult to accurately match the same target and distinguish different targets. For example: the Convolutional Neural Network (CNN) based method relies on the receptive field provided by the convolutional kernel, so that global features are difficult to extract, and a certain information loss problem is caused; the method based on the transformer focuses more on the semantic information of a high layer, but lacks modeling capability of the information of a bottom layer, and is difficult to deal with the problems of translation, rotation and distortion caused by the movement of a target. In addition, some methods rely on an independent re-recognition mechanism, and input a detection frame into a pre-trained re-recognition model to extract target appearance characteristics, which is problematic in that the detection frame carries background information to introduce noise, and the recognition degree of the target appearance characteristics is reduced to a certain extent.
In view of the above, the present application proposes a new technical solution.
Disclosure of Invention
In order to improve the feature recognition accuracy of multi-target tracking, the application provides a multi-level fusion multi-target tracking method, a multi-level fusion multi-target tracking system and a computer readable storage medium.
In a first aspect, the present application provides a multi-level fusion multi-target tracking method, which adopts the following technical scheme:
a multi-level fusion multi-target tracking method comprises the steps of extracting target re-identification characteristics by combining a zero-order segmentation network, processing a target relationship graph neural network and fusing multi-level long-range tracks;
the method for extracting the target re-identification characteristic by the combined zero-order segmentation network is called a sub-method I, and comprises the following steps:
dividing and preprocessing a human body target based on a dividing model SAM of a transducer; the method comprises the steps of,
inputting the target picture frame after the segmentation pretreatment into a pre-training re-recognition network, and extracting re-recognition characteristics of a human body target;
the target relation graph neural network processing is called a sub-method II, which comprises the following steps: giving a continuous video segment and a corresponding detection frame set, constructing a target relation graph model by using GNN, and generating a target track;
the multi-level long-range trajectory fusion is referred to as sub-method three, which includes: training the target relation graph models in the second sub-methods according to different levels of sequences, and linking video clips with different sizes by using the target relation graph models to finish track merging from short sequence to long sequence so as to obtain a target track of a complete video.
Optionally, the transformation-based segmentation model SAM performs segmentation pretreatment on a human body target, which includes: assuming a given sequence of video frames, then:
selecting an adaptive detector YOLOX for target detection to obtain target frames of all human bodies under each video frame;
inputting each video frame and a corresponding target frame set into a segmentation model SAM for example segmentation, outputting a segmentation mask with the highest score of each target, and upsampling the mask to the size of the target frame;
cutting the target frame back to the original video frame to obtain a picture frame of each target;
and setting each channel value of the picture frame as the average value of the channels according to the corresponding segmentation mask.
Optionally, the target picture frame after the segmentation pretreatment is input to a pre-training re-recognition network, and re-recognition features of the human body target are extracted, which comprises: and (3) dividing the preprocessed target picture frame into uniform size, inputting the uniform size into a pre-training re-recognition network based on ResNet-50, and extracting an appearance feature vector with the dimension of each target as a preset parameter.
Optionally, the second sub-method includes: and initializing the node characteristics of the target relation graph model into appearance characteristic vectors output by the first sub-method, and initializing the edge characteristics into cosine distances of the appearance characteristics of the relative positions in time and space between node pairs.
Optionally, the second sub-method further includes:
respectively mapping node features and edge features to the same feature space by using two different MLP layers;
features contained in nodes and edges propagate throughout the graph by way of neural messaging, and the manner of messaging includes:
message transmission from node to edge, splicing the characteristics of node u and node v and the characteristics of the connecting edge (u, v) between the node u and the node v, and obtaining the characteristics of the updated edge (u, v) through a layer of MLP;
for a node u, firstly splicing the characteristics of the node u and the characteristics of the edges (u, v), then obtaining the characteristics of the updated temporary edges (u, v) through a layer of MLP, and then averaging the characteristics of all temporary edges connected with the u to obtain the characteristics of the updated node u;
repeating the message transmission process for N times, so that all nodes can aggregate neighbor nodes and edge features with the distance of N;
and carrying out two classifications on the obtained edge characteristics to complete the prediction of the track and generate a target track.
Optionally, the third sub-method includes:
s11, dividing the whole video into a plurality of small fragments, constructing a target relation diagram for each fragment according to the second sub-method, and performing message transmission to obtain node and edge characteristics of the current level;
s12, aggregating the node and edge characteristics of the same track in an average pooling mode, and entering the next level after M training iterations;
s13, in the next level, taking the node and edge characteristics output by a plurality of sub-segments of the previous level as input, continuously constructing a target relation diagram of the new level as in the second sub-method, and executing message transmission to obtain the node and edge characteristics of the new level, thereby completing the track merging from short order to long order
S14, repeating the steps S12 and S13 for a plurality of times to obtain the target track of the complete video.
In a second aspect, the application provides a multi-level fusion multi-target tracking system, which adopts the following technical scheme:
a multi-level fusion multi-target tracking system comprising a memory and a processor, the memory having stored thereon a computer program capable of being loaded by the processor and performing any of the multi-level fusion multi-target tracking methods described above.
In a third aspect, the present application provides a computer readable storage medium, which adopts the following technical scheme:
a computer readable storage medium storing a computer program capable of being loaded by a processor and executing any one of the multi-level fusion multi-target tracking methods described above.
In summary, the present application includes at least one of the following beneficial technical effects: the SAM is utilized to realize the segmentation of the human body target, so that the human body target characteristics without background noise are obtained, and the accuracy of human body target identification in the tracking process is improved;
the strategy of hierarchical track fusion is adopted to reduce the memory overhead required by network modeling of the whole video, not only the execution speed of the model is improved, but also a relationship graph network model sharing parameters is adopted to link target tracks of different hierarchies, so that the target tracking track from short sequence to long sequence is realized, the method is suitable for multi-target tracking under the long-term association situation, and the accuracy of multi-target tracking can be improved.
Drawings
FIG. 1 is a schematic diagram of the architecture of the present application.
Detailed Description
The present application will be described in further detail with reference to fig. 1.
The embodiment of the application discloses a multi-level fusion multi-target tracking method.
Referring to fig. 1, the multi-level fusion multi-target tracking method includes: and extracting target re-identification characteristics, target relation graph neural network processing and multi-layer-level long-range track fusion by combining a zero-order segmentation network.
Extracting target re-identification features by combining a zero-order segmentation network, wherein the method comprises the steps of carrying out segmentation pretreatment on a human body target based on a segmentation model SAM of a transducer, and specifically, assuming a given video frame sequence, carrying out:
selecting an adaptive detector YOLOX for target detection to obtain target frames (x, y, w, h) of all human bodies under each video frame; wherein (x, y) is the center point of the object frame, w is the width, and h is the length.
Inputting each video frame and a corresponding target frame set into a segmentation model SAM for example segmentation, outputting a segmentation mask with the highest score of each target, and upsampling the mask to the size of the target frame, for example: w is h is 3.
Then, cutting the target frame back to the original video frame to obtain a picture frame of each target;
and setting each channel value of the picture frame as the average value of the channels according to the corresponding segmentation mask.
The arrangement aims to reduce the influence of background noise on the appearance representation of the target, the target picture frame after segmentation pretreatment is input into a pre-training re-recognition network, and the re-recognition characteristics of the human body target are extracted.
Specifically: and (3) dividing the preprocessed target picture frame into uniform size, such as 256×256×3, inputting the uniform size to a pre-training re-recognition network based on ResNet-50, and extracting an appearance feature vector with the dimension of each target as a preset parameter (such as 2048×1).
According to the arrangement, the SAM can be utilized to realize the segmentation of the human body target, so that the human body target characteristics without the background noise are obtained, and the accuracy of human body target identification in the tracking process is improved.
Sub-method two, target relation graph neural network processing, it includes: given a continuous video segment and a corresponding detection frame set, using a GNN (neural network) to construct a target relation graph model, and generating a target track, specifically:
the node characteristics of the target relation diagram (model) are initialized to form appearance characteristic vectors output by the first sub-method, and the edge characteristics are initialized to form the cosine distances of the relative positions and appearance characteristics of time and space between node pairs. For example:
specifically, each edge feature is initialized to a concatenation vector of multiple features, including (t v -t u X, y, w, h, similarity (f (u), f (v)), wherein t v -t u The time interval representing nodes u and v, x, y, w, h represents locating the target spatial location coordinates, f (·) represents the pre-trained ResNet-50 network, similarity (·, ·) represents the apparent similarity between the two nodes, and the calculation formula is as follows:
after the initialization of the node features and the edge features is completed, the node features and the edge features are mapped to the same feature space by using two different MLP layers respectively, and feature dimensions are ensured to be 256.
The features contained in the nodes and edges then propagate throughout the graph by way of neural messaging, and the manner of messaging includes:
message transmission from node to edge, splicing the characteristics of node u and node v and the characteristics of the connecting edge (u, v) between the node u and the node v, and obtaining the characteristics of the updated edge (u, v) through a layer of MLP;
for a node u, firstly splicing the characteristics of the node u and the characteristics of the edges (u, v), then obtaining the characteristics of the updated temporary edges (u, v) through a layer of MLP, and then averaging the characteristics of all temporary edges connected with the u to obtain the characteristics of the updated node u;
repeating the message passing process N (e.g., 12) times so that all nodes can aggregate neighbor nodes and edge features with a distance N;
and carrying out two classifications on the obtained edge characteristics to complete the prediction of the track and generate a target track.
The third sub-method and the multi-level long-range track fusion are mainly realized by adopting a recursion mode, training the relational graph network model (parameter sharing) of the second sub-method according to different levels of sequences, linking video clips with different sizes by using the model, and realizing track merging from short sequence to long sequence with the corresponding clip lengths of [2,4,8, 16, 32, 64, 128 and 256 ].
Specifically:
s11, dividing the whole video into a plurality of small fragments, constructing a target relation diagram for each fragment according to the second sub-method, and performing message transmission to obtain node and edge characteristics of the current level;
s12, aggregating the node and edge characteristics of the same track in an average pooling mode, and entering the next level after M training iterations (such as 200);
s13, in the next level, taking the node and edge characteristics output by a plurality of sub-segments of the previous level as input, continuously constructing a target relation diagram of the new level as in the second sub-method, and performing message transmission to obtain the node and edge characteristics of the new level, so as to finish track merging from short order to long order;
s14, repeating the steps S12 and S13 for a plurality of times (for example, 8 times), completing training of the 8 relationship graph network models, obtaining a target track of the complete video, and completing fusion of the multi-level tracks.
On one hand, the strategy of hierarchical track fusion is adopted to reduce the memory overhead required by modeling the whole video completion graph network before;
on the other hand, the fusion from the short-term track to the long-term track is realized in a sub-graph fusion mode, so that the execution speed of the model is improved, and the generalization capability of the model is ensured.
What needs to be known is:
1) The generalization capability of the existing method is not well represented. With different techniques for different time spans, strong assumptions must be made about the cues required for each time scale, which greatly limits the applicability of these strategies. For example, in tracking scenes where people wear uniform and have a high frame rate, such as in dance video, local trackers based on distance or motion tend to be more reliable than trackers based on visual information. However, in environments where the camera is moving vigorously or the frame rate is low, the performance of the local tracker may be significantly degraded, and visual cues may be the most reliable cues. In general, these differences inevitably lead to the need to customize a specific solution for each scenario.
2) The existing methods cannot process long-period video. When we expand the time span between the detection to be correlated, the correlation becomes more blurred due to the significant visual changes and the large displacement. Thus, short-term tracking methods using manually designed visual and dynamic cues cannot cope with any length of time span. Although graph-based approaches are more robust, for large time span correlations, it is necessary to create extremely large graphs, which is impractical, both in terms of computational effort and memory usage, and these problems can be alleviated by the present approach.
The embodiment of the application also discloses a multi-level fusion multi-target tracking system.
A multi-level fusion multi-target tracking system comprising a memory and a processor, the memory having stored thereon a computer program capable of being loaded by the processor and performing a multi-level fusion multi-target tracking method as described above.
The embodiment of the application also discloses a computer readable storage medium.
A computer readable storage medium storing a computer program capable of being loaded by a processor and executing a multi-level fusion multi-target tracking method as described above.
The above embodiments are not intended to limit the scope of the present application, so: all equivalent changes in structure, shape and principle of the application should be covered in the scope of protection of the application.

Claims (8)

1. A multi-level fusion multi-target tracking method is characterized in that: extracting target re-identification characteristics, processing a target relation graph neural network and fusing multi-layer long-range tracks by combining a zero-order segmentation network;
the method for extracting the target re-identification characteristic by the combined zero-order segmentation network is called a sub-method I, and comprises the following steps:
dividing and preprocessing a human body target based on a dividing model SAM of a transducer; the method comprises the steps of,
inputting the target picture frame after the segmentation pretreatment into a pre-training re-recognition network, and extracting re-recognition characteristics of a human body target;
the target relation graph neural network processing is called a sub-method II, which comprises the following steps: giving a continuous video segment and a corresponding detection frame set, constructing a target relation graph model by using GNN, and generating a target track;
the multi-level long-range trajectory fusion is referred to as sub-method three, which includes: training the target relation graph models in the second sub-methods according to different levels of sequences, and linking video clips with different sizes by using the target relation graph models to finish track merging from short sequence to long sequence so as to obtain a target track of a complete video.
2. The multi-level fusion multi-target tracking method of claim 1, wherein: the segmentation model SAM based on the transducer carries out segmentation pretreatment on a human body target, and comprises the following steps: assuming a given sequence of video frames, then:
selecting an adaptive detector YOLOX for target detection to obtain target frames of all human bodies under each video frame;
inputting each video frame and a corresponding target frame set into a segmentation model SAM for example segmentation, outputting a segmentation mask with the highest score of each target, and upsampling the mask to the size of the target frame;
cutting the target frame back to the original video frame to obtain a picture frame of each target;
and setting each channel value of the picture frame as the average value of the channels according to the corresponding segmentation mask.
3. The multi-level fusion multi-target tracking method of claim 2, wherein: the target picture frame after the segmentation pretreatment is input to a pre-training re-recognition network, and re-recognition characteristics of a human body target are extracted, wherein the method comprises the following steps:
and (3) dividing the preprocessed target picture frame into uniform size, inputting the uniform size into a pre-training re-recognition network based on ResNet-50, and extracting an appearance feature vector with the dimension of each target as a preset parameter.
4. The multi-level fusion multi-target tracking method of claim 3, wherein the sub-method two comprises: and initializing the node characteristics of the target relation graph model into appearance characteristic vectors output by the first sub-method, and initializing the edge characteristics into cosine distances of the appearance characteristics of the relative positions in time and space between node pairs.
5. The multi-level fusion multi-target tracking method of claim 4, wherein the sub-method two further comprises:
respectively mapping node features and edge features to the same feature space by using two different MLP layers;
features contained in nodes and edges propagate throughout the graph by way of neural messaging, and the manner of messaging includes:
message transmission from node to edge, splicing the characteristics of node u and node v and the characteristics of the connecting edge (u, v) between the node u and the node v, and obtaining the characteristics of the updated edge (u, v) through a layer of MLP;
for a node u, firstly splicing the characteristics of the node u and the characteristics of the edges (u, v), then obtaining the characteristics of the updated temporary edges (u, v) through a layer of MLP, and then averaging the characteristics of all temporary edges connected with the u to obtain the characteristics of the updated node u;
repeating the message transmission process for N times, so that all nodes can aggregate neighbor nodes and edge features with the distance of N; wherein N is a natural number;
and carrying out two classifications on the obtained edge characteristics to complete the prediction of the track and generate a target track.
6. The multi-level fusion multi-target tracking method of claim 5, wherein: the sub-method three, which comprises:
s11, dividing the whole video into a plurality of small fragments, constructing a target relation diagram for each fragment according to the second sub-method, and performing message transmission to obtain node and edge characteristics of the current level;
s12, aggregating the node and edge characteristics of the same track in an average pooling mode, and entering the next level after M training iterations; wherein M is a natural number;
s13, in the next level, taking the node and edge characteristics output by a plurality of sub-segments of the previous level as input, continuously constructing a target relation diagram of the new level as in the second sub-method, and executing message transmission to obtain the node and edge characteristics of the new level, thereby completing the track merging from short order to long order
S14, repeating the steps S12 and S13 for a plurality of times to obtain the target track of the complete video.
7. A multi-level fusion multi-target tracking system, characterized by: comprising a memory and a processor, said memory having stored thereon a computer program capable of being loaded by the processor and performing the multi-level fusion multi-target tracking method according to any of claims 1 to 6.
8. A computer readable storage medium storing a computer program capable of being loaded by a processor and executing the multi-level fusion multi-objective tracking method according to any one of claims 1 to 6.
CN202311018804.XA 2023-08-11 2023-08-11 Multi-level fusion multi-target tracking method, system and computer readable storage medium Pending CN117173607A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311018804.XA CN117173607A (en) 2023-08-11 2023-08-11 Multi-level fusion multi-target tracking method, system and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311018804.XA CN117173607A (en) 2023-08-11 2023-08-11 Multi-level fusion multi-target tracking method, system and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN117173607A true CN117173607A (en) 2023-12-05

Family

ID=88934676

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311018804.XA Pending CN117173607A (en) 2023-08-11 2023-08-11 Multi-level fusion multi-target tracking method, system and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN117173607A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117726821A (en) * 2024-02-05 2024-03-19 武汉理工大学 Medical behavior identification method for region shielding in medical video

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117726821A (en) * 2024-02-05 2024-03-19 武汉理工大学 Medical behavior identification method for region shielding in medical video
CN117726821B (en) * 2024-02-05 2024-05-10 武汉理工大学 Medical behavior identification method for region shielding in medical video

Similar Documents

Publication Publication Date Title
US11854240B2 (en) Vision based target tracking that distinguishes facial feature targets
US11176381B2 (en) Video object segmentation by reference-guided mask propagation
Girdhar et al. Detect-and-track: Efficient pose estimation in videos
Von Stumberg et al. Gn-net: The gauss-newton loss for multi-weather relocalization
CN113920170B (en) Pedestrian track prediction method, system and storage medium combining scene context and pedestrian social relationship
CN110909591A (en) Self-adaptive non-maximum value inhibition processing method for pedestrian image detection by using coding vector
CN111914878A (en) Feature point tracking training and tracking method and device, electronic equipment and storage medium
KR20200023221A (en) Method and system for real-time target tracking based on deep learning
CN113628244A (en) Target tracking method, system, terminal and medium based on label-free video training
Pavel et al. Recurrent convolutional neural networks for object-class segmentation of RGB-D video
Guclu et al. Integrating global and local image features for enhanced loop closure detection in RGB-D SLAM systems
CN117173607A (en) Multi-level fusion multi-target tracking method, system and computer readable storage medium
Jiao et al. Magicvo: End-to-end monocular visual odometry through deep bi-directional recurrent convolutional neural network
KR20200010971A (en) Apparatus and method for detecting moving object using optical flow prediction
CN114612545A (en) Image analysis method and training method, device, equipment and medium of related model
CN117218378A (en) High-precision regression infrared small target tracking method
CN116245913A (en) Multi-target tracking method based on hierarchical context guidance
CN116309707A (en) Multi-target tracking algorithm based on self-calibration and heterogeneous network
Mumuni et al. Robust appearance modeling for object detection and tracking: a survey of deep learning approaches
Han et al. Multi-target tracking based on high-order appearance feature fusion
Mewada et al. A fast region-based active contour for non-rigid object tracking and its shape retrieval
Chang et al. Fast Online Upper Body Pose Estimation from Video.
Liu et al. Accumulated micro-motion representations for lightweight online action detection in real-time
Li et al. Spatial-temporal graph Transformer for object tracking against noise spoofing interference
CN108346158B (en) Multi-target tracking method and system based on main block data association

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination