CN115272404B

CN115272404B - Multi-target tracking method based on kernel space and implicit space feature alignment

Info

Publication number: CN115272404B
Application number: CN202210689366.9A
Authority: CN
Inventors: 孔军; 刘加林; 蒋敏
Original assignee: Jiangnan University
Current assignee: Jiangnan University
Priority date: 2022-06-17
Filing date: 2022-06-17
Publication date: 2023-07-18
Anticipated expiration: 2042-06-17
Also published as: CN115272404A

Abstract

The multi-target tracking method provided by the invention comprises the steps of firstly, carrying out global average pooling on shared features in a channel dimension, calculating to obtain a shared semantic vector, and respectively obtaining a low-dimensional vector representing shallow semantic information and a high-dimensional vector representing deep semantic information by dimension transformation of the shared semantic vector. And then decoupling the shared semantic vector through operations such as out-of-order rearrangement, high-low dimension vector splitting and recombination and the like to obtain the high-low dimension aligned semantic vector adapting to the detection branch and the re-identification branch. And finally, carrying out weighted summation on the detection features and the shared features to obtain detection features and re-identification features with multi-dimensional semantic alignment. Through the multidimensional semantic alignment module provided by the invention, the semantic features of the shallow layer and the deep layer dimensions are aligned by the two sub-branch tasks and are independently selected, so that competition for the focus position of the feature attention in the joint optimization process is effectively relieved.

Description

Multi-target tracking method based on kernel space and implicit space feature alignment

Technical Field

The invention relates to the technical field of machine vision, in particular to a multi-target tracking method, equipment, a device and a computer storage medium based on kernel space and implicit space feature alignment.

Background

Vision is an important component of world perception, while most vision perception is inseparable from the detection and tracking of objects. Therefore, multi-target tracking is widely applied to video monitoring, automatic driving, unmanned aerial vehicles and the like in visual tasks. The multi-target tracking mainly completes detection and positioning of objects in continuous key frames, and finally forms a complete track. Early multi-target tracking methods detected and located objects by introducing some manual constraints, but as the target scene becomes complex, such as crowded people, changes in ambient light, fast moving objects, etc., traditional multi-target tracking methods performed poorly when the tracking task was completed. With the rapid development of deep learning, a large number of two-stage multi-target tracking methods are emerging. The new methods replace the original manual constraint, and lead the network to learn the optimal tracking model by introducing a large amount of data training, thereby being applicable to multi-target tracking in various complex scenes. However, the required characteristics of the two-stage tracking architecture are independent, which inevitably increases a large amount of calculation load and hinders joint optimization among stages, thereby affecting tracking efficiency.

Therefore, two single-stage tracking modes based on anchor points and anchor frames are gradually mainstream, and the two single-stage tracking modes adopt a method of inputting shared features to reduce calculation burden and improve joint optimization efficiency. However, this joint optimization approach to shared features also suffers from a number of problems. For example, distinguishing between human species during detection is focused on its own homogeneity characteristics and ignoring individual variability, while during re-identification it is necessary to pay more attention to its unique individualization characteristics for distinguishing each human individual. Therefore, the characteristics required by the detection branch and the re-identification branch are different, so that the problem of mismatching of different tasks on input characteristics and task targets is brought, the mismatching is reflected in the distribution inconsistency of the detection branch and the re-identification branch on different dimensionalities of semantic characteristics, namely the misalignment phenomenon of a nuclear space, and the multi-target tracking precision is lower.

Disclosure of Invention

Therefore, the technical problem to be solved by the invention is to solve the problem of lower multi-target tracking precision caused by misalignment of the nuclear space in the prior art.

In order to solve the technical problems, the present invention provides a multi-target tracking method, including:

acquiring a current video frame image and calculating sharing characteristics of the current video frame image;

carrying out global average pooling on the shared features in the channel dimension to obtain a shared semantic feature vector;

the shared semantic feature vector is transformed by dimension to obtain a low-dimensional vector representing shallow semantic information and a high-dimensional vector representing deep semantic information;

splitting and recombining the low-dimensional vector and the high-dimensional vector after channel disorder rearrangement to obtain multi-dimensional semantic vectors with shallow and deep semantic information while respectively adapting to the detection branch and the weight identification branch;

the multi-dimensional semantic vector of the adaptive detection branch and the multi-dimensional semantic vector of the adaptive weight recognition branch are processed through two groups of parameter independent full-connection layers, normalization pooling operation and Sigmoid activation functions respectively, and alignment of the multi-dimensional semantic vectors is completed;

respectively carrying out weighted summation on the two aligned multi-dimensional semantic vectors and the shared features to obtain multi-dimensional semantic aligned detection features and re-identification features;

calculating to obtain a current frame detection frame and a current frame appearance embedded vector according to the detection features and the re-identification features;

judging whether the current video frame image is a first frame or not, and if not, matching and associating the current frame detection frame and the current frame appearance embedded vector with a historical frame track;

and continuing to process the next video frame image until the video is finished.

Preferably, the splitting and recombining the low-dimensional vector and the high-dimensional vector after the channel is out of order and rearranged to obtain the multi-dimensional semantic vector with shallow and deep semantic information while respectively adapting to the detection branch and the weight identification branch includes:

the low-dimensional vector and the high-dimensional vector which are subjected to channel disordered rearrangement are respectively sampled through full-connection operation of four independent parameters, so that shallow semantic vectors and deep semantic vectors which are adaptive to detection branches and shallow semantic vectors and deep semantic vectors which are adaptive to weight identification branches are obtained;

and respectively combining the shallow semantic vector and the deep semantic vector of the adaptive detection branch and the shallow semantic vector and the deep semantic vector of the adaptive weight recognition branch to obtain the multidimensional semantic vector with shallow and deep semantic information when the adaptive detection branch and the adaptive weight recognition branch are respectively matched.

Preferably, the step of respectively weighting and summing the two aligned multi-dimensional semantic vectors with the shared feature to obtain the multi-dimensional semantically aligned detection feature and the re-identification feature includes:

axially pooling the re-identification features in two directions of the space dimension to obtain two axial features;

polymerizing the two axial features along the channel dimension direction to obtain an polymerized feature;

inputting the aggregation characteristic into a space alignment module to finish consistency alignment of local and global information, and obtaining the aggregation alignment characteristic;

splitting the polymeric alignment feature into two axial alignment features along the direction of the channel dimension;

and carrying out linear transformation and activation function operation on the two axial alignment features, then carrying out weighted fusion on the two axial alignment features and the re-identification features, and then connecting the two axial alignment features with the re-identification feature residual errors to obtain the trans-regional alignment re-identification features.

Preferably, the inputting the aggregate feature into the spatial alignment module to complete the consistent alignment of the local and global information, and obtaining the aggregate alignment feature includes:

the aggregation features are rearranged in a group disorder manner along an aggregation channel, and spatial axial feature alignment is completed through two full connection and activation function operations;

shifting the aggregated features with the aligned spatial axial features along the spatial axial dimension, and completing the alignment of the cross-region axial features through two full-connection and activation function operations;

and obtaining the aggregation alignment feature through shift recovery and group recovery of the aggregation feature after the cross-region axial feature alignment.

Preferably, the calculating the current frame detection frame and the current frame appearance embedded vector according to the detection feature and the re-identification feature includes:

according to the detection characteristics, thermodynamic diagram tensor, offset branch tensor and size branch tensor are obtained through calculation, and then the current frame detection frame is obtained;

and calculating to obtain an appearance embedded tensor according to the re-identification characteristics, and extracting to obtain an appearance embedded vector of the current frame.

Preferably, the calculating according to the detection feature and the re-identification feature to obtain the current frame detection frame and the current frame appearance embedded vector includes:

combining the thermodynamic diagram tensor, the offset branch tensor and the size branch tensor to obtain a combined characteristic;

performing linear transformation and activation function operation on the appearance embedded vector of the current frame to obtain a first projection vector and a second projection vector of a popular space;

multiplying the combined feature with the first projection vector, and performing linear transformation and activation function operation to obtain a combined feature with aligned detection feature and re-identification feature;

multiplying the combined feature aligned with the detection feature and the re-identification feature by the second projection vector to obtain a detection vector aligned with the associated information;

and disassembling the detection vector aligned with the association information to obtain a thermodynamic diagram tensor, an offset branch tensor and a size branch tensor aligned with the association information, and further obtaining a current frame detection frame aligned with manifold space projection.

Preferably, said matching the current frame detection frame and the current frame appearance embedding vector with a historical frame track includes:

calculating a re-identification embedded affinity matrix of all targets of the current frame and historical frame targets, and adding a constraint of a motion model for track association by combining Kalman filtering;

solving optimal matching by using a Hungary algorithm, and updating the target track state of the current frame;

and re-matching the unmatched targets by utilizing the IOU distance, and updating the target track state of the current frame.

The invention also provides a device for multi-target tracking, which comprises:

the shared characteristic calculation module is used for acquiring a current video frame image and calculating shared characteristics of the current video frame image;

the shared semantic feature vector calculation module is used for carrying out global average pooling on the shared features in the channel dimension to obtain shared semantic feature vectors;

the dimension transformation module is used for transforming the shared semantic feature vector into a low-dimensional vector representing shallow semantic information and a high-dimensional vector representing deep semantic information through dimension transformation;

the multi-dimensional semantic vector calculation module is used for carrying out splitting and recombination operation on the low-dimensional vector and the high-dimensional vector which are subjected to channel disorder rearrangement to obtain multi-dimensional semantic vectors which are respectively matched with the detection branch and the weight recognition branch and simultaneously have shallow and deep semantic information;

the multidimensional semantic vector alignment module is used for processing the multidimensional semantic vector of the adaptive detection branch and the multidimensional semantic vector of the adaptive weight recognition branch through two groups of independent parameter full-connection layers, normalization pooling operation and Sigmoid activation functions respectively to finish the alignment of the multidimensional semantic vector;

the detection feature and re-identification feature calculation module is used for respectively carrying out weighted summation on the two aligned multi-dimensional semantic vectors and the shared feature to obtain a detection feature and a re-identification feature with aligned multi-dimensional semantics;

the detection frame and appearance embedded vector calculation module is used for calculating and obtaining a current frame detection frame and a current frame appearance embedded vector according to the detection characteristics and the re-identification characteristics;

the matching association module is used for judging whether the current video frame image is a first frame or not, and if the current video frame image is not the first frame, matching and associating the current frame detection frame and the current frame appearance embedded vector with a historical frame track;

and the circulation processing module is used for continuously processing the next video frame image until the video is finished.

The invention also provides a multi-target tracking device, comprising:

a memory for storing a computer program; and the processor is used for realizing the steps of the multi-target tracking method when executing the computer program.

The present invention also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor implements the steps of a multi-target tracking method as described above.

Compared with the prior art, the technical scheme of the invention has the following advantages:

the multi-target tracking method comprises the steps of firstly, carrying out global average pooling on shared features in channel dimensions, calculating to obtain a shared semantic vector, and respectively obtaining a low-dimensional vector representing shallow semantic information and a high-dimensional vector representing deep semantic information by dimension transformation on the shared semantic vector. And then decoupling the shared semantic vector through operations such as out-of-order rearrangement, high-low dimension vector splitting and recombination and the like to obtain the high-low dimension aligned semantic vector adapting to the detection branch and the re-identification branch. And finally, carrying out weighted summation on the detection features and the shared features to obtain detection features and re-identification features with multi-dimensional semantic alignment. According to the method, on one hand, the detection branch and the re-identification branch are decoupled into two different feature inputs, and on the other hand, the inconsistency of the subtask features is relieved through the feature alignment of the shallow semantic dimension and the deep semantic dimension, so that competition of focus positions of feature attention in the joint optimization process is effectively relieved, and multi-target tracking precision is improved.

Drawings

In order that the invention may be more readily understood, a more particular description of the invention will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings, in which:

FIG. 1 is a flow chart of an implementation of a multi-target tracking method of the present invention;

FIG. 2 is a block diagram of a multidimensional semantic alignment module of the present invention;

FIG. 3 is a raw frame feature map;

FIG. 4 is a shared feature map;

FIG. 5 is a diagram of a detected alignment feature;

FIG. 6 is a re-identification alignment feature map;

FIG. 7 is a block diagram of a cross-domain embedded alignment module of the present invention;

FIG. 8 is a cross-region embedding pre-alignment feature map;

FIG. 9 is a cross-region embedded post-alignment feature map;

FIG. 10 is a block diagram of a manifold space projection alignment module of the present invention;

FIG. 11 is a manifold space projection pre-alignment feature map;

FIG. 12 is a feature diagram of manifold space projection alignment;

FIG. 13 is a unitary frame diagram of the present invention;

FIG. 14 is a graph of the results of tracking visualizations on a MOT20 common dataset;

fig. 15 is a block diagram of a device for a multi-target tracking method according to an embodiment of the present invention.

Detailed Description

The core of the invention is to provide a multi-target tracking method, a device, equipment and a computer storage medium, which improve the precision of multi-target tracking through the alignment of a nuclear space.

In order to better understand the aspects of the present invention, the present invention will be described in further detail with reference to the accompanying drawings and detailed description. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1 and fig. 2, fig. 1 is a flowchart illustrating an implementation of a multi-target tracking method provided by the present invention, and fig. 2 is a structure diagram of a multi-dimensional semantic alignment module; the specific operation steps are as follows:

s101: acquiring a current video frame image and calculating sharing characteristics of the current video frame image;

reading continuous video frames in a single frame reading mode, and using the th E [1, N ]]For example, the frame has an input size of 3 Xh _t ×w _t Where N is the total number of frames, h, of the video sequence _t And w _t The height and width of the t frame respectively;

computing sharing feature F of t-th frame using convolutional network as backbone network _b 。

S102: global average pooling is carried out on the shared features in the channel dimension C to obtain a shared semantic feature vector z _c ；

S103: obtaining a low-dimensional vector z representing shallow semantic information by dimension transformation of the shared semantic feature vector _l And a high-dimensional vector z representing deep semantic information _h ；

Firstly, the shared features are subjected to global average pooling operation to obtain a semantic vector z ₁ ∈R ^1×C . Through two sets of parameter independent fully-connected FC, the operation of the BN and ReLU activation functions is normalized, and a decoupled low-dimensional semantic vector z is obtained _l ∈R ^1×0.5C And a high-dimensional semantic vector z _h ∈R ^1×4C 。

S104: splitting and recombining the low-dimensional vector and the high-dimensional vector after channel disorder rearrangement to obtain multi-dimensional semantic vectors with shallow and deep semantic information while respectively adapting to the detection branch and the weight identification branch;

the low-dimensional vector and the high-dimensional vector after the channel disordered rearrangement are respectively sampled through the full-connection operation of four independent parameters, and the shallow semantic vector adapting to the detection branch is obtainedAnd deep semantic vectorsShallow semantic vector adapted to weight recognition branch +.>And deep semantic vectors

Respectively combining the shallow semantic vector and the deep semantic vector of the adaptive detection branch and the shallow semantic vector and the deep semantic vector of the adaptive weight recognition branch to obtain a multidimensional semantic vector z with shallow and deep semantic information at the same time of respectively adapting the detection branch and the adaptive weight recognition branch _d ∈R ^1×2.5C And z _r ∈R ^1×2.5C 。

S105: the multi-dimensional semantic vector of the adaptive detection branch and the multi-dimensional semantic vector of the adaptive weight recognition branch are processed through two groups of parameter independent full-connection layers, normalization pooling operation and Sigmoid activation functions respectively, and alignment of the multi-dimensional semantic vectors is completed;

in order to improve consistency characterization of semantic feature distribution, the alignment of semantic vectors is completed through the operation of the normalized BN and Sigmoid activation function through two groups of independent fully-connected FC.

S106: aligning two multidimensional semantic vectorsAnd->Respectively carrying out weighted summation on the detection features and the shared features to obtain detection features aligned with multi-dimensional semantics ++>And re-recognition feature->

The generalization capability of the semantic vectors is improved through the out-of-order rearrangement operation again, and the two semantic vectors are utilized The original input features are weighted respectively, residual connection is added after weighting, and feature fusion with coarse granularity is carried out, so that detected branch features are obtained>And re-recognition feature->The specific calculation flow can be expressed as +.>And

fig. 3, 4, 5 and 6 are visual effect diagrams after multi-dimensional semantic alignment is performed on a frame of image, and the visual effect diagrams are sequentially shared feature diagrams extracted from an original frame of image and a backbone network, and the aligned detection alignment feature diagrams and the re-identification alignment feature diagrams are detected, so that the detection alignment feature diagrams and the re-identification alignment feature diagrams are more focused relative to the focus areas of the shared feature diagrams, and the focused areas are different, and the attribute that different branches have different requirements on features is reflected.

S107: calculating to obtain a current frame detection frame and a current frame appearance embedded vector according to the detection features and the re-identification features;

calculating thermodynamic diagram tensor O according to the detection characteristics _heatmap Offset branch tensor O _offset And a size branch tensor O _size And further obtaining the current frame detection frame, wherein O _heatmap The method is used for positioning the center point of the object. O (O) _size For estimating the size of the object detection frame, O _offset For compensating for center point offset caused by the shared feature during downsampling;

calculating to obtain an appearance embedded tensor according to the re-identification characteristics, and extracting to obtain a 128-dimension current frame appearance embedded vector

S108: judging whether the current video frame image is a first frame or not, and if not, matching and associating the current frame detection frame and the current frame appearance embedded vector with a historical frame track;

if the current frame is the first frame, initializing a target track;

and re-matching the unmatched targets by utilizing the IOU distance, updating the track state of the targets of the current frame, and finally obtaining a stable track.

S109: and continuing to process the next video frame image until the video is finished.

The present invention is first directed to shared feature F _b Complete global average pooling in channel dimension C, and calculate to obtain shared semantic vector z _c And the shared semantic vector is transformed by dimension to obtain a low-dimension vector z representing shallow semantic information _l And a high-dimensional vector z representing deep semantic information _h . And then decoupling the shared semantic vector through operations such as out-of-order rearrangement, high-low dimension vector splitting and recombination and the like to obtain the high-low dimension aligned semantic vector adapting to the detection branch and the re-identification branch. Finally, weighting and summing the shared features to obtain multi-dimensional semantically aligned detection featuresAnd re-recognition feature->Through the multidimensional semantic alignment module provided by the invention, the semantic features of the shallow layer and the deep dimension are aligned by the two sub-branch tasks, and independent selection is carried out, so that the feature alignment in the joint optimization process is effectively relievedAttention is paid to competition for focus positions.

Based on the above embodiments, the present embodiment further describes step S106 in detail:

as shown in fig. 7, after obtaining the multi-dimensional semantically aligned detection features and re-recognition features:

the cross-region embedded vector alignment module is mainly based on local and global perceived context information, so that noise is filtered, and cross-region embedded alignment is realized. Considering the characteristic of high resolution of the input features, to reduce the computational burden, the re-identified features are axially pooled in the spatial dimensions H and W to obtain two axial features v ₁ ∈R ^C×W And v ₂ ∈R ^C×H ；

The two axial features are polymerized along the axial direction of the channel dimension C to obtain polymerized features v _hw ∈R ^C×(H+W) This ensures that the computational complexity is controlled at O ((H+W) C ² ) Is within; at the same time, such an operation can also capture long-range dependencies in one spatial direction and maintain accurate positional information in the other spatial direction;

and inputting the aggregation characteristic into a space alignment module to finish consistency alignment of local and global information, so as to obtain the aggregation alignment characteristic:

characterizing the aggregation of v _hw The packets are rearranged out of order along the aggregation channel (the shape size of the aggregation feature changes from original c× (h+w) toG is the group number of the group), and the space axial feature alignment is completed through two full connection and activation function operations;

performing shift operation (the shift step length is s, the number of channels shifted at one time is g, the feature shape and the size are unchanged) on the aggregated features after the spatial axial feature alignment along the spatial axial dimension, and completing cross-region axial feature alignment through two full-connection and activation function operations;

the aggregate characteristics after the cross-region axial characteristics are aligned are subjected to shift recovery and group recovery to obtain the aggregate alignmentFeatures (e.g. a character)

Splitting the polymeric alignment feature into two axial alignment features along the C-axis of the channel dimension;

the two axial alignment features are subjected to linear transformation and activation function operation to obtainAndweighting and fusing the identification features successively, and connecting the identification features with the identification feature residual error to obtain cross-region aligned identification features ∈>Which can be expressed as +.>

For the space dimensions H and W where the shared features are located, firstly axially pooling the re-identification features according to the two directions to respectively obtain two axial features v _h E C W and v _w E c×h. Then, combining two axial features, and finishing cross-region information interaction through a series of rearrangement operations, thereby realizing feature alignment in space dimension and obtaining an embedded vectorThe cross-region embedded alignment module constructed by the invention can effectively capture local and global context information in space dimension. And the spatial characteristics are aligned while the local perception capability and the global perception capability are balanced, so that more scientific and effective consistency characterization is obtained, the discrimination capability of the pedestrian re-recognition characteristics is effectively improved, and the tracking precision is improved.

Fig. 8 and 9 are visual effect diagrams after cross-region embedding alignment is performed on a frame of image, and compared with two feature diagrams, the visual effect diagrams can obviously improve focus and perception range after cross-region alignment, which indicates that the module can effectively filter noise and capture more effective information, and balance local perception and global perception capabilities.

Based on the above embodiments, this embodiment further describes step S107, specifically as follows:

after obtaining the current frame detection frame and the current frame appearance embedded vector:

as shown in fig. 10, fig. 10 is a block diagram of a manifold space projection alignment module, mainly for feature alignment of associated information of a detection subtask and a re-identification subtask, since feature expressions of the two branches are inconsistent in an explicit space, the alignment of the associated information of the two branches cannot be achieved by direct multiplication or addition, and the alignment of the associated features needs to be completed by projecting the two features into one manifold space through nonlinear transformation.

Combining the thermodynamic diagram tensor, the offset branch tensor and the size branch tensor to obtain a combined characteristic

Performing linear transformation and activation function operation on the current frame appearance embedded vector to obtain a first projection vector E of popular space _k ∈R ^HW×128 And a second projection vector E _v ∈R ^128×HW ；

Combining the featuresAnd the first projection vector E _k Multiplying (obtaining a 7X 128-dimensional projection vector which represents the representation of the detection feature and the re-identification feature in manifold space), and performing linear transformation and activation function operation to obtain a combined feature of the detection feature and the re-identification feature alignment;

aligning the combined features of the detected and re-identified features with the second projection vector E _v Multiplying to obtain the switchDetection vector for alignment of linkage information

Resolving the detection vectors aligned with the associated information to obtain thermodynamic diagram tensors aligned with the associated informationOffset branch tensor->And size branch tensor->And further obtaining a current frame detection frame with manifold space projection aligned.

Fig. 11 and 12 are views of the visual effect of the alignment module through manifold space projection. By comparing the feature images before and after projection, the focus area of the detection target is obviously more concentrated, which benefits from the fact that after the associated features of the two branches are aligned, the detection branch obtains more supplementary information from the re-identification branch, so that the object is positioned more accurately.

For the associated characteristic representation in the detection branch and the re-identification branch, the re-identification branch output is completed through twice manifold space projection transformation and weighted fusionImplicit alignment of the correlation features to 3 outputs of the detection branch, resulting in an implicitly spatially aligned detection output +.> And->In the present invention, the targets of interest for detecting and re-identifying branches are the sameIn the method, the correlation exists between the two sub-tasks, and a certain mapping relation exists in the manifold space, so that a manifold space projection alignment module is introduced, the joint optimization of the two sub-tasks can be effectively coordinated, and competing conflicts caused by unaligned feature distribution are avoided.

The joint optimization method of the shared features faces some problems, for example, the distinguishing of human species in the detection process is focused on the self homogeneous features and neglecting individual variability, and the distinguishing of each human individual in the re-identification process is focused on the special personalized features. Thus, the features required to detect branches and re-identify branches are heterogeneous, which results in the fact that they have misalignment of feature distributions in the kernel space and implicit space, and how to obtain input features that fit a particular branch task and have consistent consistency is a topic where joint optimization is not bypassed. Based on the above consideration, the invention provides a tracking framework (as shown in fig. 13) based on kernel space and implicit space feature alignment, so as to alleviate the problem of optimization conflict caused by feature misalignment. Firstly, alignment of subtask nuclear space is realized through a multidimensional semantic alignment module and a cross-region embedding alignment module. Through the operation, the detection branch and the re-identification branch can respectively obtain the independent characteristics, and the consistent alignment of the subtask characteristics in different semantic dimensions and embedding dimensions is realized. Second, for related branch tasks, their associated information should be able to find the corresponding mapping relationship in a common manifold space and promote successful optimization of each subsequent object task. Therefore, the invention realizes the characteristic alignment of the associated information of different subtasks through the manifold space projection alignment module, thereby meeting the coordination consistency of the multi-task joint optimization. The invention mainly aims to provide a multi-feature alignment tracking framework MFATracker based on kernel space and implicit space feature alignment, which can better complete the joint optimization of multiple tasks under complex scenes such as pedestrian crowding and the like, and realize the optimal solution of each branch subtask, thereby improving the precision and the robustness of multi-target tracking.

Based on the above examples, in order to verify the accuracy and robustness of the present invention, the present invention performs multiple experiments on the disclosed MOT17 and MOT20 data sets, specifically as follows:

fig. 14 is a graph of the tracking results of the tracker on the MOT20 dataset, showing 45, 181, 281 frames from left to right, respectively, wherein the same pedestrians on different left and right frames are assigned the same identification number.

The MOT17 data set comprises 14 video sequences and 1342 tracks, wherein interference factors such as different camera angles, different weather conditions, different camera movements and the like exist, and the crowd density distribution is balanced. MOT20 is a newer dataset containing a total of 8 video sequences, about 13400 frames. The average crowd density is 246 pedestrians per frame, and is a crowded scene, so that the method has a greater challenge.

The experiment is divided into two parts, namely, the offline ablation verification is mainly completed on a training set of MOT17, and the online full set verification is mainly completed on the full sets of MOT17 and MOT 20.

And (one) performing offline ablation verification:

setting experimental parameters: the basic setting is consistent with the base line network FairMOT 3, and after the pre-training on the CrowdHuman data set is completed, the first half sequence frame of the MOT17 training set is used as the training set, and the second half sequence frame is used as the verification set. In the training stage, input pictures are uniformly adjusted to 1088 x 608, iterative training is carried out by adopting an Adam optimizer, the learning rate of the first 20 rounds is ensured to be 0.0001 in the training process, and the last 10 rounds are reduced to 0.00001.

The method provided by the invention mainly comprises three parts: 1) A multi-dimensional semantic alignment Module (MSA); 2) Transregional embedded alignment module (CEA); 3) Manifold space projection alignment Module (MSPA). From the results of table 1, it is obvious that the tracking accuracy and robustness are steadily improved after the corresponding modules are added. Baseline in Table 1 represents a base network model, MOTA represents a tracking accuracy index, IDF1 represents a target identity correctness index, and IDS represents the frequency of switching of the same target identity.

Table 1 results of experiments on MOT17 validation set (MOTA ≡IDF1 ≡IDS ≡)

Network configuration	MOTA↑	IDF1↑	IDS↓
				Baseline	71.1	73.2	437
Baseline+MSA	71.9	74.3	420
				Baseline+MSA+CEA	72.1	74.4	402
Baseline+MSA+MSPA	72.0	74.3	416
				Baseline+ALL	72.3	74.4	407

(II) on-line corpus verification:

the online data set collection adopts MOTChalinge public data set, and the website is https:// motchlenge. Setting experimental parameters: when training on the MOT17 corpus, the parameter settings were consistent with the offline ablation verification experimental network parameter settings except that the first 7 video sequences were used as the training set and the last 7 video sequences were used as the test set. When training on the MOT20 corpus, the parameter settings were kept consistent with the base line network, the training set used the first 4 video sequences of MOT20 and the test set used the last 4 video sequences. The first 15 rounds of training were fine-tuned, the learning rate remained at 0.0001, and the last 5 rounds were reduced to 0.00001. The test results are shown in table 2:

TABLE 2 Multi-target tracking accuracy MOTA results on MOT17 and MOT20

Data set	MOT17(％)	MOT20(％)
			Base line network	73.7	61.8
The invention is that	74.2	66.4

Referring to fig. 15, fig. 15 is a block diagram of a device for a multi-target tracking method according to an embodiment of the present invention; the specific apparatus may include:

the shared feature calculation module 100 is configured to obtain a current video frame image, and calculate shared features of the current video frame image;

the shared semantic feature vector calculation module 200 is configured to perform global average pooling on the shared features in a channel dimension to obtain a shared semantic feature vector;

the dimension transformation module 300 is configured to obtain a low-dimensional vector representing shallow semantic information and a high-dimensional vector representing deep semantic information by dimension transformation of the shared semantic feature vector;

the multidimensional semantic vector calculation module 400 is configured to perform splitting and recombining operations on the low-dimensional vector and the high-dimensional vector after the channel is disordered and rearranged, so as to obtain multidimensional semantic vectors with shallow and deep semantic information while respectively adapting to the detection branch and adapting to the weight recognition branch;

the multidimensional semantic vector alignment module 500 is configured to process the multidimensional semantic vector of the adaptation detection branch and the multidimensional semantic vector of the adaptation weight identification branch through two sets of parameter independent full-connection layers, normalization pooling operation and Sigmoid activation function respectively, so as to complete alignment of the multidimensional semantic vectors;

the detection feature and re-identification feature calculation module 600 is configured to respectively weight and sum the two aligned multi-dimensional semantic vectors with the shared feature to obtain a detection feature and a re-identification feature aligned with the multi-dimensional semantic;

the detection frame and appearance embedded vector calculation module 700 is configured to calculate a current frame detection frame and a current frame appearance embedded vector according to the detection feature and the re-identification feature;

the matching association module 800 is configured to determine whether the current video frame image is a first frame, and if not, match and associate the current frame detection frame and the current frame appearance embedded vector with a historical frame track;

the loop processing module 900 is configured to continue processing the next video frame image until the video is finished.

The multi-target tracking apparatus of the present embodiment is used to implement the multi-target tracking method described above, so that the specific implementation of the multi-target tracking apparatus can be seen in the example portions of the multi-target tracking method described above, for example, the shared feature computing module 100, the shared semantic feature vector computing module 200, the dimension transformation module 300, the multi-dimensional semantic vector computing module 400, the multi-dimensional semantic vector alignment module 500, the detection feature and re-identification feature computing module 600, the detection frame and appearance embedding vector computing module 700, the matching association module 800, and the loop processing module 900, which are used to implement steps S101, S102, S103, S104, S105, S106, S107, S108, and S109 in the multi-target tracking method described above, respectively, so that the specific implementation thereof can refer to the description of the corresponding respective portion embodiments, which will not be repeated herein.

The embodiment of the invention also provides a multi-target tracking device, which comprises: a memory for storing a computer program; and the processor is used for realizing the steps of the multi-target tracking method when executing the computer program.

The specific embodiment of the invention also provides a computer readable storage medium, wherein the computer readable storage medium stores a computer program, and the computer program realizes the steps of the multi-target tracking method when being executed by a processor.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It is apparent that the above examples are given by way of illustration only and are not limiting of the embodiments. Other variations and modifications of the present invention will be apparent to those of ordinary skill in the art in light of the foregoing description. It is not necessary here nor is it exhaustive of all embodiments. While still being apparent from variations or modifications that may be made by those skilled in the art are within the scope of the invention.

Claims

1. A multi-target tracking method, comprising:

calculating to obtain thermodynamic diagram tensors, offset branch tensors and size branch tensors according to the detection characteristics, and further obtaining a current frame detection frame; calculating to obtain an appearance embedding tensor according to the re-identification characteristics, and extracting to obtain a 128-dimension current frame appearance embedding vector;

2. The multi-target tracking method according to claim 1, wherein the splitting and recombining the low-dimensional vector and the high-dimensional vector after the channel is out of order and rearranged to obtain multi-dimensional semantic vectors with shallow and deep semantic information while respectively adapting to the detection branch and the weight identification branch comprises:

3. The multi-target tracking method according to claim 1, wherein the step of weighting and summing the aligned two multi-dimensional semantic vectors with the shared feature to obtain the multi-dimensional semantically aligned detection feature and the re-identification feature comprises:

axially pooling the re-identification features in two directions of the space dimension H and the space dimension W to obtain two axial features;

polymerizing the two axial features along the direction of the C axis of the channel dimension to obtain polymerized features;

and carrying out linear transformation and activation function operation on the two axial alignment features, then carrying out weighted fusion on the two axial alignment features and the re-identification features, and adding residual connection and the re-identification features to carry out coarse-fine granularity feature fusion after weighting so as to obtain the trans-regional alignment re-identification features.

4. The multi-target tracking method of claim 3, wherein the inputting the aggregate features into the spatial alignment module to complete consistent alignment of local and global information, the obtaining the aggregate alignment features comprises:

performing shift operation on the aggregated features with the aligned spatial axial features along the spatial axial dimension, wherein the shift step length is s, the number of channels shifted at one time is g, the feature shape and the size are unchanged, and the cross-region axial feature alignment is completed through two full-connection and activation function operations;

5. The multi-target tracking method according to claim 1, wherein the calculating the current frame detection frame and the current frame appearance embedded vector according to the detection feature and the re-identification feature comprises:

6. The multi-target tracking method of claim 1 wherein said matching the current frame detection box and the current frame appearance embedding vector to a historical frame track comprises:

7. An apparatus for multi-target tracking, comprising:

the detection frame and appearance embedding vector calculation module is used for calculating thermodynamic diagram tensors, offset branch tensors and size branch tensors according to the detection characteristics so as to obtain a current frame detection frame; calculating to obtain an appearance embedding tensor according to the re-identification characteristics, and extracting to obtain a 128-dimension current frame appearance embedding vector;

8. An apparatus for multi-target tracking, comprising:

a memory for storing a computer program;

a processor for implementing the steps of a multi-object tracking method according to any one of claims 1 to 6 when executing said computer program.

9. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of a multi-object tracking method according to any of claims 1 to 6.