CN116977367A

CN116977367A - Campus multi-target tracking method based on transform and Kalman filtering

Info

Publication number: CN116977367A
Application number: CN202310868814.6A
Authority: CN
Inventors: 马苗; 李海洋; 裴炤; 姚超; 武杰; 杨楷芳
Original assignee: Shaanxi Normal University
Current assignee: Shaanxi Normal University
Priority date: 2023-07-14
Filing date: 2023-07-14
Publication date: 2023-10-31

Abstract

A campus multi-target tracking method based on a transform and a Kalman filter comprises fusing a convolutional neural network and the transform model based on a PyTorch frame, and predicting by using the Kalman filter as a nonlinear tracker to generate a multi-target tracking network model; training a multi-target tracking network model; a multi-target tracking network model is detected. The invention combines the convolutional neural network, the transducer model and the Kalman filtering, improves the precision of the multi-target tracking task, and solves the problems of multi-target tracking target loss and the like in a target dense scene such as a campus. The method can be used for vehicle-road coordination, intelligent traffic, intelligent security, campus security, video monitoring and the like, and the multi-target tracking technology is widely applied in a plurality of important fields.

Description

Campus multi-target tracking method based on transform and Kalman filtering

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a campus multi-target tracking method based on a transform and Kalman filtering.

Background

The multi-target tracking technology is continuously applied to various aspects of production, life, military and the like at present, such as the fields of video monitoring, vehicle-road coordination, intelligent transportation, intelligent security, automatic driving and the like. Visual multi-target tracking is an important component of the computer field, but multi-target tracking in a complex background is still a very challenging task in the computer vision field due to various complex scenes such as occlusion, illumination, deformation, and the like.

The position of the object tracked by the visual multi-target has large change range, and the number of tracked objects is not fixed. In addition, the multi-target tracking problem typically tracks multiple objects of a given type, which have some apparent contour similarity. The following problems are generally solved in addition to factors such as object deformation and background interference: the automatic initialization and automatic termination of the tracking target, namely the problems of the appearance of a new target and the disappearance of an old target; when motion prediction and similarity judgment of a tracking target and interaction between occlusion of the tracking target and the target are carried out, the traditional method cannot quickly respond, so that tracking error is larger.

Disclosure of Invention

The technical problem to be solved by the invention is to overcome the defects of the prior art, and provide the campus multi-target tracking method based on the transform and Kalman filtering, which has the advantages of simple method and high tracking precision.

The technical scheme adopted for solving the technical problems is as follows: a campus multi-target tracking method based on a transform and Kalman filtering comprises the following steps:

step 1, constructing a multi-target tracking network model

Based on a PyTorch frame, fusing a convolutional neural network and a transducer model, and predicting by using a Kalman filter as a nonlinear tracker to generate a multi-target tracking network model;

the multi-target tracking network model is formed by sequentially connecting a main network, a Transformer Encoder module, a Transformer Decoder module and a predictive head feedforward network, wherein the main network is a MobileNet V2 network and is used for extracting a characteristic diagram of an input picture to obtain a basic characteristic diagram; the Transformer Encoder module is used for adding position codes to the basic feature map output by the backbone network, and extracting global features of the basic feature map through a self-attention mechanism to obtain a global feature map; the Transformer Decoder module is used for reconstructing the global feature map output by the Transformer Encoder module and learning features, and obtaining token prediction after attention and mapping; the prediction head feedforward network is used for classifying and regressing;

the method for predicting by using the Kalman filter as the nonlinear tracker comprises the following steps:

a. according to the result output by the feedforward network of the pre-measuring head, the distance between targets is obtained by weighting the cosine distance of the appearance characteristic and the covariance distance between the pre-measuring state and the measuring state;

distance c between the targets _i,j ：

c _i,j ＝λd ⁽¹⁾ +(1-λ)d ⁽²⁾

d ⁽¹⁾ ＝(d-y) ^T S ^-1 (d-y)

d ⁽²⁾ ＝min{1-cosα}

Wherein lambda is a hyper-parameter controlling the influence of each index on the merging cost, cos alpha is cosine similarity, d ⁽²⁾ For cosine distance, d is measurement distribution, y is prediction distribution, S is covariance matrix between measurement distribution and prediction distribution；

b. Obtaining a prediction result by using a Kalman filter according to the following formulas (1) and (2);

x _z ＝Fx _z-1 +B _z u _z (1)

in which x is _z Is the predicted value of the state at the next moment, x _z-1 Is the state at the current moment, F is the prediction step, B _z To control the matrix, u _z In order to control the vector quantity,for the prediction estimation error of the current moment, P _z-1 For the prediction estimation error of the previous moment, F ^T For predicting the transpose of the step F matrix, Q is the covariance of the influence of external factors;

c. substituting the distance matrix between the targets into a cascade matching method and a IoU matching method to match the prediction result and the actual result of the Kalman filter, updating the state of the successfully matched target after the matching is successful, and establishing the unsuccessfully matched target as a new target to perform the next round of comparison matching;

d. updating the predicted track, and predicting the track of the next frame by using a Kalman filter;

step 2, training a multi-target tracking network model;

step 3, detecting the multi-target tracking network model

Setting the confidence threshold of the network parameter area of the multi-target tracking network model to be 0.05, wherein other parameters are network defaults, inputting the image in the test set into the trained multi-target tracking network model, and outputting a multi-target tracking result.

As a preferred technical solution, the method for adding position codes in the Transformer Encoder module in the step S1 is as follows: firstly, partitioning a basic feature map, and respectively carrying out horizontal direction position coding and vertical direction position coding according to the position of each picture block in an original map, wherein the horizontal direction position coding consists of sine values obtained according to the following formula (3) at even positions in the horizontal direction of the picture block and cosine values obtained according to the following formula (4) at odd positions;

wherein PE (p) _β ,β) _s Is of sine value, PE (p _β ,β) _c Is cosine value, p _β For the position of a picture block in the basic feature map, beta is the index of the dimension of the picture block, d _m An output dimension of Transformer Encoder module;

and splicing the horizontal direction position codes and the vertical direction position codes to obtain a vector with the same dimension as the output characteristic dimension of the convolutional neural network, and using the vector as the position code of the basic characteristic diagram.

As a preferred technical solution, in the step S1, the prediction head feedforward network includes a ReLU activation function and three linear layers with hidden layers, and the output of the last R layer of the prediction head feedforward network may be used as the conditional probability of each class, that is:

J＝softmax(Z ^(R) )

z in ^(R) Is the net input to the R layer and J is the R layer activity value.

As a preferable technical solution, the step S2 of training the multi-target tracking network model includes the following specific steps:

step 2.1. Setting training parameters

Drawing a frame from the MOT17-03-SDP data set every 0.1 second, acquiring the first 450 images as a training set and the last 50 images as a test set, uniformly adjusting image pixels in the training set and the test set to 1920×1080, setting the initial learning rate to lr to 0.0002, and optimizing the multi-target tracking network model by using an SGD (generalized gateway) optimizer;

s2.2. network model parameter initialization

Pre-training the backbone network by using an ImageNet data set to obtain weights and biases, wherein the weights and biases are used as initial weights and biases of the backbone network;

s2.3. training multi-target tracking network model

And inputting all images of the training set into the multi-target tracking network model for forward propagation and calculating a loss function until the convergence training of the loss function is finished, so as to obtain the trained multi-target tracking network model.

The beneficial effects of the invention are as follows:

the invention combines the convolutional neural network, the transducer model and the Kalman filtering, improves the precision of the multi-target tracking task, and solves the problems of multi-target tracking target loss and the like in a target dense scene such as a campus. The method can be used for vehicle-road coordination, intelligent traffic, intelligent security, campus security, video monitoring and the like, and the multi-target tracking technology is widely applied in a plurality of important fields.

Drawings

Fig. 1 is a schematic flow chart of a campus multi-target tracking method based on a transform and a kalman filter.

FIG. 2 is a schematic diagram of the architecture of the multi-objective tracking network model of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, but the present invention is not limited to the following embodiments.

In fig. 1, the campus multi-target tracking method based on transform and kalman filtering in this embodiment includes the following steps:

step 1, constructing a multi-target tracking network model

Based on a PyTorch framework, a convolutional neural network and a transducer model are fused, and a Kalman filter is used as a nonlinear tracker for prediction, so that a multi-target tracking network model is generated.

The multi-objective tracking network model is composed of a main network, a Transformer Encoder module, a Transformer Decoder module and a predictive head feedforward network which are sequentially connected in series, as shown in fig. 2.

The backbone network is a MobileNet V2 network and is used for extracting the feature map of the input picture to obtain a basic feature map.

Transformer Encoder module is configured to add position codes to the basic feature map output by the backbone network, and perform global feature extraction on the basic feature map through a self-attention mechanism, so as to obtain a global feature map.

The method for adding the position codes in the Transformer Encoder module comprises the following steps: firstly, partitioning a basic feature map, and respectively carrying out horizontal direction position coding and vertical direction position coding according to the position of each picture block in an original map, wherein the horizontal direction position coding consists of sine values obtained according to the following formula (1) at even positions in the horizontal direction of the picture block and cosine values obtained according to the following formula (2) at odd positions;

Transformer Decoder module for reconstructing the global feature map output by Transformer Encoder module and learning features, and obtaining token prediction after attention and mapping;

the pre-measurement head feedforward network comprises a ReLU activation function and three linear layers with hidden layers, wherein the ReLU activation function is used for normalizing center coordinates, heights and widths, the linear layers predict class labels through a softmax function, and the output of the last R layer of the pre-measurement head feedforward network can be used as the conditional probability of each class, namely:

J＝softmax(Z ^(R) )

z in ^(R) Is the net input to the R layer and J is the R layer activity value.

wherein the distance c between the targets _i,j ：

c _i,j ＝λd ⁽¹⁾ +(1-λ)d ⁽²⁾

d ⁽¹⁾ ＝(d-y) ^T S ^-1 (d-y)

d ⁽²⁾ ＝min{1-cosα}

Wherein lambda is a hyper-parameter controlling the influence of each index on the merging cost, cos alpha is cosine similarity, d ⁽²⁾ The cosine distance is d, the measurement distribution is y, the prediction distribution is y, and S is a covariance matrix between the measurement distribution and the prediction distribution;

b. obtaining a prediction result by using a Kalman filter according to the following formulas (3) and (4);

x _z ＝Fx _z-1 +B _z u _z (3)

in which x is _z Is the predicted value of the state at the next moment, x _z-1 Is the state at the current moment, F is the prediction step, B _z To control the matrix, u _z In order to control the vector quantity,for the prediction estimation error of the current moment, P _z-1 To last onePrediction estimation error of time, F ^T For predicting the transpose of the step F matrix, Q is the covariance of the influence of external factors;

step 2, training a multi-target tracking network model

Step 2.1. Setting training parameters

step 2.2. Network model parameter initialization

step 2.3 training a Multi-target tracking network model

Inputting all images of a training set into a multi-target tracking network model for forward propagation and calculating a loss function, predicting a normalized center coordinate through a pre-measurement head feedforward network, and obtaining a predicted class label by a linear layer through a softmax function, and performing next epoch training until the loss function convergence training is finished to obtain a trained multi-target tracking network model;

the softmax function is:

f(m,θ _w，x )＝cos(m ₁ ,θ _w，x +m ₃ )-m ₂

wherein L is loss, k is category number, x is vector to be classified, s is scale super parameter, w is weight vector, p _y Is the predicted posterior probability, g (·) function is used to mine difficult samples, i _k To indicate a function, to dynamically specify whether a sample is misclassified,is the angle between w and x. />Is a marginal function, m in the formula ₁ ≥1，m ₂ >0，m ₃ >0,h(t,θ _w，x ,I _k ) Is a weighted function of misclassified samples, where t is a preset hyper-parameter.

Step 3, detecting the multi-target tracking network model

Claims

1. A campus multi-target tracking method based on a transform and Kalman filtering is characterized by comprising the following steps:

step 1, constructing a multi-target tracking network model

the multi-target tracking network model is formed by sequentially connecting a main network, a Transformer Encoder module, a Transformer Decoder module and a predictive head feedforward network in series, wherein the main network is a MobileNet V2 network and is used for extracting a characteristic diagram of an input picture to obtain a basic characteristic diagram; the Transformer Encoder module is used for adding position codes to the basic feature map output by the backbone network, and extracting global features of the basic feature map through a self-attention mechanism to obtain a global feature map; the Transformer Decoder module is used for reconstructing the global feature map output by the Transformer Encoder module and learning features, and obtaining token prediction after attention and mapping; the prediction head feedforward network is used for classifying and regressing;

distance c between the targets _i,j ：

c _i,j ＝λd ⁽¹⁾ +(1-λ)d ⁽²⁾

d ⁽¹⁾ ＝(d-y) ^T S ^-1 (d-y)

d ⁽²⁾ ＝min{1-cosα}

x _z ＝Fx _z-1 +B _z u _z (1)

c. substituting the distance matrix between the targets into a cascade matching method and a IoU matching method to carry out comparison matching on a prediction result and an actual result by using a Kalman filter, updating the state of the successfully matched target after the matching is successful, and establishing the unsuccessfully matched target as a new target to carry out next round of comparison matching;

step 2, training a multi-target tracking network model;

step 2.1. Setting training parameters

One frame of image was extracted every 0.1 second from the MOT17-03-SDP dataset, and the extracted image was extracted according to 9:1, dividing the model into a training set and a testing set, uniformly adjusting image pixels in the training set and the testing set to 1920 multiplied by 1080, setting the initial learning rate 1r to 0.0002, and optimizing the multi-target tracking network model by using an SGD (generalized gateway) optimizer;

step 2.2. Network model parameter initialization

step 2.3 training a Multi-target tracking network model

Inputting all images of the training set into the multi-target tracking network model for forward propagation and calculating a loss function until the convergence training of the loss function is finished, so as to obtain a trained multi-target tracking network model;

step 3, detecting the multi-target tracking network model

2. The campus multi-target tracking method based on the transform and kalman filtering according to claim 1, wherein the method for adding the position code in the Transformer Encoder module in the step S1 is as follows: firstly, partitioning a basic feature map, and respectively carrying out horizontal direction position coding and vertical direction position coding according to the position of each picture block in an original map, wherein the horizontal direction position coding consists of sine values obtained according to the following formula (3) at even positions in the horizontal direction of the picture block and cosine values obtained according to the following formula (4) at odd positions;

3. The campus multi-objective tracking method based on the transform and kalman filtering according to claim 1, wherein the predicting the head feed-forward network in step S1 includes a ReLU activation function and three linear layers with hidden layers, and the output of the last R layer of the predicting head feed-forward network can be used as the conditional probability of each class, namely:

J＝softmax(Z ^(R) )

z in ^(R) Is the net input to the R layer and J is the R layer activity value.