CN117173607A

CN117173607A - Multi-level fusion multi-target tracking method, system and computer readable storage medium

Info

Publication number: CN117173607A
Application number: CN202311018804.XA
Authority: CN
Inventors: 刘宪钦; 张逸君; 张伟
Original assignee: Guangdong Jiamaotong Technology Co ltd; Guangdong Branch Of National Customs Information Center
Current assignee: Guangdong Jiamaotong Technology Co ltd; Guangdong Branch Of National Customs Information Center
Priority date: 2023-08-11
Filing date: 2023-08-11
Publication date: 2023-12-05

Abstract

The application discloses a multi-level fusion multi-target tracking method, a system and a computer readable storage medium, which comprise the following steps: extracting target re-identification characteristics by combining a zero-order segmentation network, processing a target relationship graph neural network, and fusing multi-layer long-range tracks; a sub-method one, comprising: dividing and preprocessing a human body target based on a dividing model SAM of a transducer; inputting the target picture frame after the segmentation pretreatment into a pre-training re-recognition network, and extracting re-recognition characteristics of a human body target; a second sub-method, comprising: giving a continuous video segment and a corresponding detection frame set, and constructing a target relation graph model by using GNN; a sub-method three, comprising: training the target relation graph models in the second sub-methods according to different levels of sequences, and linking video clips with different sizes by using the target relation graph models. The method has the effect of improving the feature recognition accuracy of multi-target tracking.

Description

Multi-level fusion multi-target tracking method, system and computer readable storage medium

Technical Field

The application relates to the technical field of machine vision, in particular to a multi-level fusion multi-target tracking method, a system and a computer readable storage medium.

Background

Multi-object tracking (Multiple Object Tracking) is one of the fundamental tasks of computer vision, whose object is to track multiple objects over time in a video sequence. The core goal of multi-target tracking is to handle various challenges such as occlusion, motion blur, and light variation while still maintaining the identity of each object between successive frames.

Most multi-target Tracking algorithms take the paradigm of TBD (Tracking-by-Detection): first detecting a target in each frame; the detected objects frame by frame are then correlated to form a trajectory.

In the environment where high-precision object detection exists, data correlation mainly occurs between detection objects that are close in time, so-called short-term correlation (short-term). In general, simple cues like position, motion related proximity or local appearance are sufficient to ensure accurate correlation.

However, in crowded scenes, we face different challenges, such as objects may often be occluded and not detected for a few frames of time. This requires us to detect correlations between time frames that are long-term correlations.

Given the differences in the nature of these tasks, solutions commonly used to handle short-term associations tend to be difficult to handle in long-term association scenarios.

In response to long-term associated challenges, the existing visual information-based tracking method has the main problems that the extracted visual characteristics of the appearance of the target are not strong in characterization capability, and the extracted visual characteristics are difficult to accurately match the same target and distinguish different targets. For example: the Convolutional Neural Network (CNN) based method relies on the receptive field provided by the convolutional kernel, so that global features are difficult to extract, and a certain information loss problem is caused; the method based on the transformer focuses more on the semantic information of a high layer, but lacks modeling capability of the information of a bottom layer, and is difficult to deal with the problems of translation, rotation and distortion caused by the movement of a target. In addition, some methods rely on an independent re-recognition mechanism, and input a detection frame into a pre-trained re-recognition model to extract target appearance characteristics, which is problematic in that the detection frame carries background information to introduce noise, and the recognition degree of the target appearance characteristics is reduced to a certain extent.

In view of the above, the present application proposes a new technical solution.

Disclosure of Invention

In order to improve the feature recognition accuracy of multi-target tracking, the application provides a multi-level fusion multi-target tracking method, a multi-level fusion multi-target tracking system and a computer readable storage medium.

In a first aspect, the present application provides a multi-level fusion multi-target tracking method, which adopts the following technical scheme:

a multi-level fusion multi-target tracking method comprises the steps of extracting target re-identification characteristics by combining a zero-order segmentation network, processing a target relationship graph neural network and fusing multi-level long-range tracks;

the method for extracting the target re-identification characteristic by the combined zero-order segmentation network is called a sub-method I, and comprises the following steps:

dividing and preprocessing a human body target based on a dividing model SAM of a transducer; the method comprises the steps of,

inputting the target picture frame after the segmentation pretreatment into a pre-training re-recognition network, and extracting re-recognition characteristics of a human body target;

the target relation graph neural network processing is called a sub-method II, which comprises the following steps: giving a continuous video segment and a corresponding detection frame set, constructing a target relation graph model by using GNN, and generating a target track;

the multi-level long-range trajectory fusion is referred to as sub-method three, which includes: training the target relation graph models in the second sub-methods according to different levels of sequences, and linking video clips with different sizes by using the target relation graph models to finish track merging from short sequence to long sequence so as to obtain a target track of a complete video.

Optionally, the transformation-based segmentation model SAM performs segmentation pretreatment on a human body target, which includes: assuming a given sequence of video frames, then:

selecting an adaptive detector YOLOX for target detection to obtain target frames of all human bodies under each video frame;

inputting each video frame and a corresponding target frame set into a segmentation model SAM for example segmentation, outputting a segmentation mask with the highest score of each target, and upsampling the mask to the size of the target frame;

cutting the target frame back to the original video frame to obtain a picture frame of each target;

and setting each channel value of the picture frame as the average value of the channels according to the corresponding segmentation mask.

Optionally, the target picture frame after the segmentation pretreatment is input to a pre-training re-recognition network, and re-recognition features of the human body target are extracted, which comprises: and (3) dividing the preprocessed target picture frame into uniform size, inputting the uniform size into a pre-training re-recognition network based on ResNet-50, and extracting an appearance feature vector with the dimension of each target as a preset parameter.

Optionally, the second sub-method includes: and initializing the node characteristics of the target relation graph model into appearance characteristic vectors output by the first sub-method, and initializing the edge characteristics into cosine distances of the appearance characteristics of the relative positions in time and space between node pairs.

Optionally, the second sub-method further includes:

respectively mapping node features and edge features to the same feature space by using two different MLP layers;

features contained in nodes and edges propagate throughout the graph by way of neural messaging, and the manner of messaging includes:

message transmission from node to edge, splicing the characteristics of node u and node v and the characteristics of the connecting edge (u, v) between the node u and the node v, and obtaining the characteristics of the updated edge (u, v) through a layer of MLP;

for a node u, firstly splicing the characteristics of the node u and the characteristics of the edges (u, v), then obtaining the characteristics of the updated temporary edges (u, v) through a layer of MLP, and then averaging the characteristics of all temporary edges connected with the u to obtain the characteristics of the updated node u;

repeating the message transmission process for N times, so that all nodes can aggregate neighbor nodes and edge features with the distance of N;

and carrying out two classifications on the obtained edge characteristics to complete the prediction of the track and generate a target track.

Optionally, the third sub-method includes:

s11, dividing the whole video into a plurality of small fragments, constructing a target relation diagram for each fragment according to the second sub-method, and performing message transmission to obtain node and edge characteristics of the current level;

s12, aggregating the node and edge characteristics of the same track in an average pooling mode, and entering the next level after M training iterations;

s13, in the next level, taking the node and edge characteristics output by a plurality of sub-segments of the previous level as input, continuously constructing a target relation diagram of the new level as in the second sub-method, and executing message transmission to obtain the node and edge characteristics of the new level, thereby completing the track merging from short order to long order

S14, repeating the steps S12 and S13 for a plurality of times to obtain the target track of the complete video.

In a second aspect, the application provides a multi-level fusion multi-target tracking system, which adopts the following technical scheme:

a multi-level fusion multi-target tracking system comprising a memory and a processor, the memory having stored thereon a computer program capable of being loaded by the processor and performing any of the multi-level fusion multi-target tracking methods described above.

In a third aspect, the present application provides a computer readable storage medium, which adopts the following technical scheme:

a computer readable storage medium storing a computer program capable of being loaded by a processor and executing any one of the multi-level fusion multi-target tracking methods described above.

In summary, the present application includes at least one of the following beneficial technical effects: the SAM is utilized to realize the segmentation of the human body target, so that the human body target characteristics without background noise are obtained, and the accuracy of human body target identification in the tracking process is improved;

the strategy of hierarchical track fusion is adopted to reduce the memory overhead required by network modeling of the whole video, not only the execution speed of the model is improved, but also a relationship graph network model sharing parameters is adopted to link target tracks of different hierarchies, so that the target tracking track from short sequence to long sequence is realized, the method is suitable for multi-target tracking under the long-term association situation, and the accuracy of multi-target tracking can be improved.

Drawings

FIG. 1 is a schematic diagram of the architecture of the present application.

Detailed Description

The present application will be described in further detail with reference to fig. 1.

The embodiment of the application discloses a multi-level fusion multi-target tracking method.

Referring to fig. 1, the multi-level fusion multi-target tracking method includes: and extracting target re-identification characteristics, target relation graph neural network processing and multi-layer-level long-range track fusion by combining a zero-order segmentation network.

Extracting target re-identification features by combining a zero-order segmentation network, wherein the method comprises the steps of carrying out segmentation pretreatment on a human body target based on a segmentation model SAM of a transducer, and specifically, assuming a given video frame sequence, carrying out:

selecting an adaptive detector YOLOX for target detection to obtain target frames (x, y, w, h) of all human bodies under each video frame; wherein (x, y) is the center point of the object frame, w is the width, and h is the length.

Inputting each video frame and a corresponding target frame set into a segmentation model SAM for example segmentation, outputting a segmentation mask with the highest score of each target, and upsampling the mask to the size of the target frame, for example: w is h is 3.

Then, cutting the target frame back to the original video frame to obtain a picture frame of each target;

The arrangement aims to reduce the influence of background noise on the appearance representation of the target, the target picture frame after segmentation pretreatment is input into a pre-training re-recognition network, and the re-recognition characteristics of the human body target are extracted.

Specifically: and (3) dividing the preprocessed target picture frame into uniform size, such as 256×256×3, inputting the uniform size to a pre-training re-recognition network based on ResNet-50, and extracting an appearance feature vector with the dimension of each target as a preset parameter (such as 2048×1).

According to the arrangement, the SAM can be utilized to realize the segmentation of the human body target, so that the human body target characteristics without the background noise are obtained, and the accuracy of human body target identification in the tracking process is improved.

Sub-method two, target relation graph neural network processing, it includes: given a continuous video segment and a corresponding detection frame set, using a GNN (neural network) to construct a target relation graph model, and generating a target track, specifically:

the node characteristics of the target relation diagram (model) are initialized to form appearance characteristic vectors output by the first sub-method, and the edge characteristics are initialized to form the cosine distances of the relative positions and appearance characteristics of time and space between node pairs. For example:

specifically, each edge feature is initialized to a concatenation vector of multiple features, including (t _v -t _u X, y, w, h, similarity (f (u), f (v)), wherein t _v -t _u The time interval representing nodes u and v, x, y, w, h represents locating the target spatial location coordinates, f (·) represents the pre-trained ResNet-50 network, similarity (·, ·) represents the apparent similarity between the two nodes, and the calculation formula is as follows:

after the initialization of the node features and the edge features is completed, the node features and the edge features are mapped to the same feature space by using two different MLP layers respectively, and feature dimensions are ensured to be 256.

The features contained in the nodes and edges then propagate throughout the graph by way of neural messaging, and the manner of messaging includes:

repeating the message passing process N (e.g., 12) times so that all nodes can aggregate neighbor nodes and edge features with a distance N;

The third sub-method and the multi-level long-range track fusion are mainly realized by adopting a recursion mode, training the relational graph network model (parameter sharing) of the second sub-method according to different levels of sequences, linking video clips with different sizes by using the model, and realizing track merging from short sequence to long sequence with the corresponding clip lengths of [2,4,8, 16, 32, 64, 128 and 256 ].

Specifically:

s12, aggregating the node and edge characteristics of the same track in an average pooling mode, and entering the next level after M training iterations (such as 200);

s13, in the next level, taking the node and edge characteristics output by a plurality of sub-segments of the previous level as input, continuously constructing a target relation diagram of the new level as in the second sub-method, and performing message transmission to obtain the node and edge characteristics of the new level, so as to finish track merging from short order to long order;

s14, repeating the steps S12 and S13 for a plurality of times (for example, 8 times), completing training of the 8 relationship graph network models, obtaining a target track of the complete video, and completing fusion of the multi-level tracks.

On one hand, the strategy of hierarchical track fusion is adopted to reduce the memory overhead required by modeling the whole video completion graph network before;

on the other hand, the fusion from the short-term track to the long-term track is realized in a sub-graph fusion mode, so that the execution speed of the model is improved, and the generalization capability of the model is ensured.

What needs to be known is:

1) The generalization capability of the existing method is not well represented. With different techniques for different time spans, strong assumptions must be made about the cues required for each time scale, which greatly limits the applicability of these strategies. For example, in tracking scenes where people wear uniform and have a high frame rate, such as in dance video, local trackers based on distance or motion tend to be more reliable than trackers based on visual information. However, in environments where the camera is moving vigorously or the frame rate is low, the performance of the local tracker may be significantly degraded, and visual cues may be the most reliable cues. In general, these differences inevitably lead to the need to customize a specific solution for each scenario.

2) The existing methods cannot process long-period video. When we expand the time span between the detection to be correlated, the correlation becomes more blurred due to the significant visual changes and the large displacement. Thus, short-term tracking methods using manually designed visual and dynamic cues cannot cope with any length of time span. Although graph-based approaches are more robust, for large time span correlations, it is necessary to create extremely large graphs, which is impractical, both in terms of computational effort and memory usage, and these problems can be alleviated by the present approach.

The embodiment of the application also discloses a multi-level fusion multi-target tracking system.

A multi-level fusion multi-target tracking system comprising a memory and a processor, the memory having stored thereon a computer program capable of being loaded by the processor and performing a multi-level fusion multi-target tracking method as described above.

The embodiment of the application also discloses a computer readable storage medium.

A computer readable storage medium storing a computer program capable of being loaded by a processor and executing a multi-level fusion multi-target tracking method as described above.

The above embodiments are not intended to limit the scope of the present application, so: all equivalent changes in structure, shape and principle of the application should be covered in the scope of protection of the application.

Claims

1. A multi-level fusion multi-target tracking method is characterized in that: extracting target re-identification characteristics, processing a target relation graph neural network and fusing multi-layer long-range tracks by combining a zero-order segmentation network;

2. The multi-level fusion multi-target tracking method of claim 1, wherein: the segmentation model SAM based on the transducer carries out segmentation pretreatment on a human body target, and comprises the following steps: assuming a given sequence of video frames, then:

3. The multi-level fusion multi-target tracking method of claim 2, wherein: the target picture frame after the segmentation pretreatment is input to a pre-training re-recognition network, and re-recognition characteristics of a human body target are extracted, wherein the method comprises the following steps:

and (3) dividing the preprocessed target picture frame into uniform size, inputting the uniform size into a pre-training re-recognition network based on ResNet-50, and extracting an appearance feature vector with the dimension of each target as a preset parameter.

4. The multi-level fusion multi-target tracking method of claim 3, wherein the sub-method two comprises: and initializing the node characteristics of the target relation graph model into appearance characteristic vectors output by the first sub-method, and initializing the edge characteristics into cosine distances of the appearance characteristics of the relative positions in time and space between node pairs.

5. The multi-level fusion multi-target tracking method of claim 4, wherein the sub-method two further comprises:

repeating the message transmission process for N times, so that all nodes can aggregate neighbor nodes and edge features with the distance of N; wherein N is a natural number;

6. The multi-level fusion multi-target tracking method of claim 5, wherein: the sub-method three, which comprises:

s12, aggregating the node and edge characteristics of the same track in an average pooling mode, and entering the next level after M training iterations; wherein M is a natural number;

7. A multi-level fusion multi-target tracking system, characterized by: comprising a memory and a processor, said memory having stored thereon a computer program capable of being loaded by the processor and performing the multi-level fusion multi-target tracking method according to any of claims 1 to 6.

8. A computer readable storage medium storing a computer program capable of being loaded by a processor and executing the multi-level fusion multi-objective tracking method according to any one of claims 1 to 6.