CN113379793B

CN113379793B - On-line multi-target tracking method based on twin network structure and attention mechanism

Info

Publication number: CN113379793B
Application number: CN202110545060.1A
Authority: CN
Inventors: 陈光柱; 李春江
Original assignee: Chengdu Univeristy of Technology
Current assignee: Chengdu Univeristy of Technology
Priority date: 2021-05-19
Filing date: 2021-05-19
Publication date: 2022-08-12
Anticipated expiration: 2041-05-19
Also published as: CN113379793A

Abstract

The invention provides an online multi-target tracking method based on a twin network structure and an attention mechanism, wherein the online multi-target tracking is a process of expanding multi-target tracks only by using past information. In this process, tracking drift, missed detection and occlusion between objects are common problems. For tracking drift, the invention provides a novel multi-target tracking algorithm, which comprises a fusion framework of YOLOv3 and KCF based on a twin network structure, and designs a Linear Regression Model (LRM) which can simultaneously detect and track a target, wherein the linear regression model can combine a detection result and a tracking result to inhibit the tracking drift. In addition, in order to overcome the problems of detection omission and occlusion, the invention provides a structural similarity calculation method (SSC) based on an attention mechanism, which firstly learns and extracts the distinguishing features of each object by using the attention mechanism and finally obtains the relevance between the target and the track by using the structural similarity calculation. Finally, on the basis of the above, a multi-target tracking strategy is designed, and initialization (initialization), extension (extended) and termination (Inactivate) of each track can be stably realized.

Description

On-line multi-target tracking method based on twin network structure and attention mechanism

Technical Field

The present invention relates to the field of vision, and more particularly to the field of multi-target tracking.

Background

Multi-object tracking MOT (Multi-object tracking MOT) technology is widely used in practical applications, such as automatic driving, human-computer interaction, video surveillance, virtual reality, and the like. For the concept of MOT, the process mainly determines the positions of multiple targets in multiple continuous images, and determines the track of each visible target. The key technology is the initialization, extension and termination of the track. MOT is mainly divided into two stages: performing target detection on the image; and calculating the target correlation in the previous frame image and the next frame image and generating corresponding tracks. In the multi-target tracking process, two key problems exist: problems of missed and inaccurate detection and difficult correlation due to occlusion of the target. Therefore, the invention provides a fusion framework of the YOLOv3 and KCF algorithm based on a twin structure, which is used for inhibiting the missing detection and inaccurate detection of target detection; an image structure similarity algorithm framework based on an attention mechanism is designed, and the calculation capacity of the relevance of the target and the track is increased. Finally, a multi-target tracking strategy is designed on the basis, and long-time tracking of targets in external scenes is achieved.

Disclosure of Invention

Aiming at the challenges, the invention provides a long-term online multi-target tracking method based on an integrated framework and attention (IFA-MOT), which can effectively ensure the long-term MOT. And (3) calculating the structural similarity, which is used for solving the problems of irregular motion of a large number of similar objects: firstly, extracting the characteristics of a current frame image and the characteristics of a tracked object in a past frame image through the same network; the detection and tracking results are then obtained using the detection model and the tracking model, respectively. Finally, structural similarity calculations based on attention mechanism and location information are used to correlate the detection and tracking results.

In a first aspect, the present invention provides a detector-tracker integrated framework (DTI) that can simultaneously perform online detection and tracking of objects and provide more abundant location information for MOTs. A Linear Regression Model (LRM) is designed to combine the tracking result with the detection result, so that the detection drift can be inhibited.

In a second aspect, the present invention provides a structural similarity calculation method based on attention mechanism to obtain the uniqueness of the object itself and increase the correlation between the object and the trajectory. The method can better deal with the problems of track interruption, posture change and occlusion caused by irregular motion.

In a third aspect, the invention designs a multi-stage MOT strategy (MOTS) based on the above two aspects to form a complete online MOT system.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 shows an optimization framework diagram of the KCF algorithm of the present invention.

FIG. 2 shows a fusion framework diagram of the YOLOv3 algorithm and the optimized KCF algorithm according to an embodiment of the invention.

FIG. 3 shows f in the YOLOv3 algorithm and KCF algorithm according to the present invention _x The extracted feature map.

Fig. 4 shows a structural similarity calculation diagram based on an attention mechanism (SSC) according to the present invention.

FIG. 5 illustrates a multi-target tracking strategy (MOTS) diagram according to the present invention.

In order that those skilled in the art will better understand the concepts of the present invention, reference will now be made to the examples of the present invention.

Detailed Description

Referring to fig. 1, fig. 1 shows an example diagram of an optimized KCF of the present invention. And the improved KCF algorithm takes the image of the previous frame and the ROI corresponding to the target as input to complete the calculation of the filter of the target. Then, the current frame image is input, and the position in the current frame image is obtained by using the calculated filter calculation.

After the improvement of the KCF algorithm is completed, the invention utilizes a twin network structure to fuse YOLOv3 and the KCF algorithm:

referring to fig. 2, fig. 2 shows a fused example of YOLOv3 and KCF provided by the present invention.

Wherein, OBJ is the target to be tracked (also the target tracked by the previous frame); IM is the current frame image；f _x DBL module in YOLOv 3; f. of _track Is the trimmed KCF algorithm (pre-tracker); f. of _detection Remove the remaining network structure of the first DBL for YOLOv 3; f _obj And F _frame Through f for OBJ and IM respectively _x A characteristic diagram of (1); box _t Bounding boxes obtained for tracking; box _d A bounding box detected by the detection module; LRM (line Regression model) is a linear Regression model, as shown in the following formula.

Box _new ＝α×Box _d +β×Box _t ，

Where α and β are equilibrium parameters, the sum of which is equal to 1, and α is set to 0.8 by experiment; box _t 、 Box _d And Box _new Denoted as [ Top, bottom, left, right ]]，[Top,left],[bottom,right]Representing the coordinates of the bounding box in the upper left and lower right corners of the image, respectively.

The calculation flow of the fusion framework is as follows:

first, using the above frame image and the corresponding ROI area (OBJ in FIG. 2) as input, calculating the filter through the optimized KCF frame (shown in FIG. 1), and completing f _track And (4) updating.

Subsequently, taking the current frame (IM) as an input, firstly obtaining a feature map through the DBL module, wherein the feature map is an image of 64 channels, and an effect map of the feature map is as shown in an example in fig. 3.

Finally, F is mixed _frame Respectively as f _detection And f of the calculated filter _track Inputting the module to obtain a detection bounding Box Box _d And pre-tracing bounding Box Box _t . And finally, obtaining a final boundary box by utilizing LRM regression.

In the multi-target tracking process, the condition that the target disappears to reappear often occurs, so that the association between the target and the track is particularly important, and therefore the image structure similarity algorithm based on the attention mechanism is designed. This will be described in detail below:

referring to fig. 4, fig. 4 is a diagram illustrating an image structure similarity algorithm based on attention mechanism according to the present invention. Wherein, Im1 and Im2 are inputs, which mainly representTwo target samples; resize represents the image that changed the image to W × H × 3; f ₁ And F ₂ Is the image after Resize; AT (attention template) represents an attention template;

comprises the following three steps: convolution (AT as the convolution kernel), BatchNormal, and Sigmoid; r, T represent the Reshape and transpose processes of the matrix, respectively; fc represents the full connection layer; AR stands for Attention Result (Attention Result); n represents a total of N training samples; w, H and C represent the length, width and number of channels of the matrix, respectively, where W60, H120 and C64;

calculating for the correlation;

is the Euclidean distance; o is expressed as the mutual mean square error (CMSE); the remaining parameters are described in the following description.

The specific steps of the structural similarity calculation are as follows:

training of the AT is first required (on-line training) before performing structural similarity calculation of the images. The attention model designed by the invention is an end-to-end network and does not need pre-training. The structure is shown in FIG. 4 as the Attention Model, which takes Im2 (image) as input and gets F after Resize ₂ And obtaining a characteristic diagram F through behavior ++ _y . Finally change F _y After shaping, it is fed to fc to give AR. Here, if Im2 is a positive sample, then the true tag of the AR is set to [0,1 ]]Otherwise, set to [1,0 ]]. In order to train the model more accurately, the method takes N images as Im2, wherein k images are used as historical observation information graphs of the tracking target; the remaining N-k images are viewed with historical information maps of other targets.

When training of the attention template is completed, given Im1 and Im2, it should be noted here that Im1 and Im2 are two images to be judged. First transformed by Resize to 120 and 60 widths in height and width, respectivelyF ₁ And F ₂ It is intended to ensure uniformity of input shapes. Then F is mixed ₁ And F ₂ Obtaining F by ^ F _x And F _y The convolution kernel of the convolution layer AT ≦ is the trained AT.

Subsequently, in order to better obtain the Similarity (SD) between Im1 and Im2, the autocorrelation matrix (SCor) and the cross-correlation matrix (CCor) of the two samples are first calculated, but since the height of the image and the width of the image are not equal, the invention uses the following two formulas to calculate the SCor in the two directions respectively _x ，SCor _y And CCor.

Wherein, SCor _x And SCor _y Autocorrelation matrixes of x and y samples respectively; CCor is the cross-correlation matrix of x, y samples;

are respectively F _x ，F _y Transpose of the feature map of the sample.

After obtaining the SCor _x ，SCor _y After CCor, the correlation index score (IdxS) is calculated using the Euclidean Distance (ED) ₁ 、IdxS ₂ And IdxS ₃ ) Here, since the height of the image and the width of the image are not equal, two sets of correlation index scores in the height and width directions (hereinafter, referred to as two directions) are obtained, as shown in the following formula.

Finally, after obtaining the correlation index scores of two directions, the correlation Cross Mean Square Error (CMSE) meterCalculating the similarity SD in two directions _w And SD _h (CMSE(IdxS ₁ ，IdxS ₂ ，IdxS ₃ ) Where sum is the summation operation.

Finally, the similarity between Im1 and Im2 is derived from the mean of the similarities in the two directions, as shown in the following equation:

for better obtaining the Correlation between the target and the track, the invention can calculate the similarity as a Correlation Score (CS) by establishing the following formula, wherein the CS is limited between 0 and 1.

Wherein, SD ₁ Similarity between two latest historical observation information on a certain track; SD ₂ For a detected certain target and SD ₁ The similarity of the latest historical observation information of the corresponding track. When SD ₁ And SD ₂ As the distance approaches, i.e., the correlation between the target and the trajectory increases, CS converges to 1, and conversely, CS becomes 0. And for the setting of the threshold value, the invention sets

0.6 when CS between target and track is greater than

It means that the object is likely to be an extension of the corresponding track, whereas the object is not an extension of the corresponding track.

On the basis of a fusion framework and an attention-based mechanism, a multi-target tracking strategy applicable to the method is described:

referring to fig. 5, fig. 5 is a diagram illustrating a multi-target tracking strategy according to the present invention.

In a first stage (S1), targets and existing tracks of a scene are detected and pre-tracked using the YOLOv3 algorithm (Detector) and the modified KCF algorithm.

Subsequently, in the second phase (S2), the first association is done using the IoU algorithm:

wherein, when IoU is greater than 0.5, the matching is successful, otherwise, the matching is not successful.

After S2, matching states between the target and the track can be obtained, including Isolated matching (one-to-one), Competitive matching (one-to-many or many-to-one), and Failed matching (Failed), while for different matching states, it is mainly processed in the third stage and the fourth stage (S3 and S4), respectively:

1) when the matching state is the isolated matching state, the fourth stage is directly performed (S4): and performing LRM and track extension on the target and the position predicted by the matched pre-tracking, and updating the attention template.

2) When the matching state is a competition state, the target and the track of the competition state are expressed as follows:

Com _obj ＝{obj,tra ₁ ,tra ₂ ,...,tra _i },

Com _tra ＝{tra,obj ₁ ,obj ₂ ,...,obj _j },

where i, j ∈ N, obj is an object detected by the detector (hereinafter referred to as an object); tra is a target (hereinafter referred to as a trajectory) tracked by the tracker; com _obj And Com _tra The states of competition, hereinafter referred to as two competition states, are respectively expressed as a plurality of tracks with a single target and a plurality of targets with a single track.

The process for two race conditions is determined by the SSC, expressing two cases:

CS _obj ＝{SSC(obj,tra _m )},m∈[0,i]

CS _tra ＝{SSC(obj _n ,tra)},n∈[0,j]

wherein CS _obj And CS _tra Respectively, represent the relevance scores of the two competition states.

When the CS is obtained by the calculation, the best match is obtained by the following two equations:

here, argmax represents an index of a maximum value, that is, represents a matching relationship between a target and a track; BestMatch _obj And BestMatch _tra Expressed as the best match for both race conditions.

If the above condition is met, the corresponding trajectory and target will be classified as Isolated, proceed directly to S4, and update the corresponding attention template. Otherwise, the state is judged as failure: traces that fail matching will be classified as Inactive traces (Int, Inactive traces) and will no longer participate in the tracking calculation of the current frame; the target that fails the match will proceed with the subsequent SSC correlation.

3) For objects that fail matching, these objects will be correlated with historically calculated inactivity tracks (Inactive destinations) using the SSC. After the association is successful, directly extending the corresponding track by using the detected target, and updating the corresponding attention template, otherwise, if the association is failed, initially trying out a new track by using the target which is not matched; while the failed trace is still placed in Int, if the trace spends more than 10 frames in Int, the trace extension is terminated and no further contribution is made to the subsequent calculations.

Claims

1. The online multi-target tracking method based on the twin network structure and the attention mechanism is characterized by comprising the following steps of:

step 1: detecting and tracking a design process of the fusion framework; the YOLOv3 algorithm and the KCF algorithm are improved and fused: firstly, the KCF algorithm is improved: discarding an initialization process and a correlation vector updating iteration process of a KCF model; on the basis, the optimized KCF model is fused with the YOLOv3 algorithm: the DBL module in the YOLOv3 is used as a common network (twin structure) to complete the extraction of the features, the DBL module is used for the feature extraction in the previous stage of KCF and a detection branch, and the detection branch is a frame of the YOLOv3 after the DBL module is removed;

step 2: the computational process for detecting and tracking the fusion framework is as follows:

firstly, taking a target tracked by a previous frame and a previous frame image as input, and calculating to obtain a filter template of each target through an improved KCF algorithm, wherein each target corresponds to one filter template;

then, taking the tracked target of the previous frame and the current frame image as input, and acquiring a bounding Box Box of the corresponding target in the current frame image through a KCF algorithm with a calculated filter template _t It is a pre-tracking process; while pre-tracking, detecting the target Box appearing in the current frame by using the detection branch _d Which is a detection process of each frame; since there are two kinds of bounding boxes, a linear regression model is designed for Box _t And Box _d And fusing the results to obtain the latest bounding Box Box _new ：

Box _new ＝α×Box _d +β×Box _t ，

Wherein alpha and beta are equilibrium parameters Box _t 、Box _d And Box _new Is expressed as [ Top, bottom, left, right ]]，[Top,left],[bottom,right]Coordinates of the bounding box at the upper left corner and the lower right corner in the image are represented respectively;

the Box _d Is disordered, and Box _t Is ordered, therefore, by image attention mechanismExtracting the unique identification information of each target, and finally completing the association of the target and the estimation by utilizing the image structure similarity, comprising the following steps:

step 21: firstly, completing the on-line training of an attention template; taking a convolution kernel with M by M and the number of channels being {3, C } as an attention template, then carrying out full connection, and finally finishing the binary classification of the images; the method mainly comprises the following steps of taking N images as training samples, wherein k images are historical observation information graphs of a certain track and are positive samples; the remaining N-k images are historical observation information graphs of other tracks and are negative samples; in the process of on-line training, images of training image samples need to be unified into W x H images; it should be noted here that each trajectory corresponds to a different attention template;

step 22: after the training of the attention template is completed, firstly two image samples x and y are unified into a three-channel image with the width W and the height H, and then feature extraction is carried out to obtain F _x And F _y The two features are both matrixes of W, H, C, wherein the feature extraction process comprises convolution of an attention template, BatchNormal and Sigmoid; subsequently, since the height and width of the images are not equal, the autocorrelation matrix SCor and the cross-correlation matrix CCor in the height and width directions of the two images are obtained by the following two equations:

wherein, x and y are two samples with the similarity degree to be calculated; SCor _x And SCor _y Respectively, an autocorrelation matrix of x, y samples, CCor is a cross-correlation matrix of x, y samples, F _x ，F _y Characteristic diagrams of x and y samples respectively,

are respectively F _x ，F _y Transposing a characteristic graph of the sample;

step 23: after the calculation of SCor and CCor is finished, the similarity SD of the image in the height direction and the width direction is calculated by the following method _w And SD _h (ii) a Since the calculation in both directions is the same, SD is used _w The calculation of (a) is explained as an example:

1) calculating the similarity score IdxS, and obtaining the SCor of the image in the height direction and the width direction _x 、SCor _y And after CCor, calculating a correlation index score IdxS by using Euclidean distance, wherein the expression formula is as follows:

wherein ED is the Euclidean distance;

2) calculating SD in two directions through CMSE _w As shown in the following formula:

wherein sum is a summation operation;

through the steps, SD can be obtained simultaneously _h The similarity between x and y is obtained from the mean SD of the similarities in both the height and width directions of the image, as shown in the following formula:

calculating a target-track correlation score CS by establishing the following formula, wherein the range of the target-track correlation score CS is limited between 0 and 1;

wherein, SD ₁ Similarity between two latest historical observation information on a certain track; SD ₂ For a detected certain target and SD ₁ The similarity degree of the latest historical observation information of the corresponding track; when SD ₁ And SD ₂ The closer, i.e. the stronger the correlation between the target and the trajectory, the CS converges to 1, otherwise to 0; will threshold value

Set to 0.6 when CS between target and track is greater than

2. The twin network structure and attention mechanism based on-line multi-target tracking method as claimed in claim 1, wherein the image similarity discrimination calculation of the attention mechanism is called SSC; the design of multi-objective tracking strategies, in which three matching states are mainly handled: isolated matching, competitive matching and failed matching, comprising the steps of:

step 1: when the matching state is an isolated matching state, performing linear regression model and track extension on the target and the position pre-judged by the matched pre-tracking, and updating the attention template;

step 2: when the matching state is a competition state, the target and the track of the competition state are represented as follows:

Com _obj ＝{obj,tra ₁ ,tra ₂ ,…,tra _i }，

Com _tra ＝{tra,obj ₁ ,obj ₂ ,…,obj _j }，

where i, j ∈ N, obj is the target detected by the detector toThe following is called a target; tra is the target tracked by the tracker, hereinafter referred to as the trajectory; com _obj And Com _tra Respectively representing the competition states of a plurality of tracks and a single target and a plurality of targets and a single track, which are hereinafter referred to as two competition states;

the processing for two race conditions is mainly judged by the SSC, expressed as the following two cases:

CS _obj ＝{SSC(obj,tra _m )},m∈[0,i]，

CS _tra ＝{SSC(obj _n ,tra)},n∈[0,j]，

wherein CS _obj And CS _tra A correlation score representing the two competition states, respectively;

when the CS is obtained by calculation, the best match is obtained by the following two equations:

wherein argmax represents the index of the maximum value, i.e. represents the matching relationship between the target and the track; BestMatch _obj And BestMatch _tra Expressed as the best match of the two competing cases;

if the above conditions are met, the track has a corresponding target to extend, and a corresponding attention template is updated; otherwise, the state is judged as a failure state; the track which fails to match is classified as the inactive track memory and is not involved in the tracking calculation of the current frame;

and step 3: for targets that fail to match, these targets will be correlated with historically calculated inactivity tracks using SSC; track extension and attention mechanism template updating are carried out after association is successful, otherwise, if association is failed, a new track is preliminarily created by a target which fails to be matched; and the failed track can still be put into the memory of the inactive track, if the storage time in the memory of the inactive track is more than a certain number of frames, the track is the final track of the corresponding target and does not participate in subsequent judgment and extension.