CN113128605A

CN113128605A - Target tracking method based on particle filtering and depth distance measurement learning

Info

Publication number: CN113128605A
Application number: CN202110442516.1A
Authority: CN
Inventors: 王洪雁; 张莉彬; 袁海; 张鼎卓; 周贺; 薛喜扬
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2021-04-23
Filing date: 2021-04-23
Publication date: 2021-07-16

Abstract

The invention discloses a target tracking method based on particle filtering and depth distance measurement learning, and relates to the field of automatic driving target visual tracking. The method comprises the following steps: constructing a nonlinear depth measurement learning model; training the nonlinear depth measurement learning model based on a given automatic driving target positive and negative sample set, and optimizing nonlinear depth learning model parameters based on a gradient descent method; constructing a target observation model based on particle filtering to obtain an optimal estimation of the state of the automatic driving target; and the target template is updated through an online tracking strategy combining short-term and long-term stable updating, so that the automatic driving target is effectively tracked. The method has better performance in partial shielding, illumination change and other scenes; compared with a comparison algorithm, under most test scenes, the average center error of the method is low, the average overlapping rate is high, and the overall tracking performance of the method is excellent.

Description

Target tracking method based on particle filtering and depth distance measurement learning

Technical Field

The invention relates to the field of automatic driving target visual tracking, in particular to an automatic driving target visual tracking method based on particle filtering and depth distance measurement learning.

Background

The automatic driving relates to the fields of information perception, information processing, decision execution and the like, wherein the information perception is used as a basic module for collecting driving environment information, and relates to a plurality of information collecting sensors such as laser radars, millimeter wave radars, ultrasonic radars, GPS (global positioning system), cameras and the like. As a sensor that can collect rich scene information and is inexpensive, a camera has been considered as an automatic driving matching scene information sensing apparatus by the industry. Thus, camera-based automatic driving target tracking has become one of the research hotspots in the field of computer vision. In recent years, numerous efficient and robust automatic driving visual tracking algorithms are proposed in succession, and the practical process of target visual tracking is greatly promoted. However, due to the complexity of the actual automatic driving scene, a great deal of interference and uncertainty factors such as illumination change, size change, target occlusion and the like exist in the tracking process, so that the tracking performance is significantly reduced. Therefore, how to improve the accuracy and robustness of the automatic driving target visual tracking algorithm in a complex scene is still one of the research difficulties in the field of visual tracking.

Aiming at the problem of reduced target visual tracking performance in a complex scene, Nam H and the like propose a deep learning tracking method, the method firstly trains a network offline and then finely adjusts network parameters to finally obtain a relatively good network model, but due to the problems of long time consumption, poor pertinence of feature training and the like, Zhang K H and the like propose a visual tracking algorithm (CNT) adopting a convolution network, the method firstly uses a K-means algorithm to construct a feature map set, then noise is reduced on a training result image based on an adaptive threshold shrinkage algorithm, and finally a target model is constructed based on sparse representation, but the resolution of a feature extraction map can be reduced due to convolution operation of the algorithm. To solve the above problem, Lu X K et al propose a regression network that maps samples to a labeled response graph, however, since the dimension mismatch problem may occur between the target and the background, the authors propose a method that considers the loss function of shrinkage loss and performs regression. Hu J proposes to improve the target-to-background discrimination problem by learning a nonlinear distance metric using stacked independent subspace analysis networks, and a Discriminant Depth Metric Learning (DDML) based method explicitly obtains a nonlinear distance metric by using a large margin criterion at the top of a trained depth network, but since a very large set of auxiliary data is required and may not be consistent with an online captured object, the learned features cannot be adapted to these objects.

Disclosure of Invention

Aiming at the problem of performance degradation of the traditional target tracking method in a complex environment, the invention provides a target tracking method based on particle filtering and depth distance measurement learning, which comprises the following steps:

constructing a nonlinear depth measurement learning model;

training the nonlinear depth measurement learning model based on a given automatic driving target positive and negative sample set, and optimizing nonlinear deep learning model parameters based on a gradient descent method;

constructing a target observation model based on particle filtering to obtain an optimal estimation of the state of the automatic driving target;

and the target template is updated through an online tracking strategy combining short-term and long-term stable updating, so that the automatic driving target is effectively tracked.

Due to the adoption of the technical scheme, the invention can obtain the following technical effects: the automatic driving target visual tracking method combining depth distance measurement learning and particle filtering provided by the invention has higher target tracking precision and robustness when the target is tracked in a complex environment. The method constructs a nonlinear depth measurement learning model based on a depth network; then optimizing the obtained depth measurement learning model parameters based on a gradient descent algorithm; then constructing an observation model based on the obtained optimal candidate target predicted value to obtain the optimal estimation of the automatic driving target state; and finally, updating the target template based on a short-term and long-term stable updating combined updating strategy so as to realize effective tracking of the automatic driving target. As can be seen from qualitative analysis, the method has better performance in partial shading, illumination change and other scenes; based on quantitative analysis, compared with a comparison algorithm, under most test scenes, the average central error of the method is lower, the average overlapping rate is higher, and the overall tracking performance of the method is better.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a flow chart of an implementation of the present invention;

FIG. 2 is a graph of the tracking results of five different tracking algorithms;

FIG. 3 is a graph of tracking success rate for different tracking methods;

FIG. 4 is a graph of overall accuracy for different tracking methods.

Detailed Description

The implementation steps of the present invention are further described in detail below with reference to the accompanying drawings and specific embodiments: the invention provides an automatic driving target visual tracking method based on particle filtering and depth distance measurement learning, aiming at the problem that the target tracking performance is obviously reduced due to factors such as illumination change, target deformation and partial shielding in a complex environment. The method firstly learns the layered nonlinear transformation in the feedforward neural network based on a depth measurement learning method; then mapping the template and the particles to the same feature space, and maximizing the minimum difference in the positive training pair class and the difference between the negative training pair classes in the feature space; then, identifying the candidate which is most similar to the template obtained by the depth measurement network as a real target based on a particle filter framework; and finally, updating the template on line based on an updating strategy combining short-term and long-term stable updating to reduce the influence of adverse factors so as to realize the effective tracking of the automatic driving target. The experimental result shows that compared with the existing mainstream tracking algorithm, the method provided in the complex environment has higher target tracking precision and better robustness. The method comprises the following specific steps:

step 1, constructing a nonlinear depth measurement learning model, specifically:

constructing a deep network model by combining the samples

Passed into a multi-layer non-linear transformation to learn its non-linear representation. The number of network layers K is 1,2^(k)Representing the dimension of the samples in the k-th layer network, the first layer samples are output as follows:

wherein, W⁽¹⁾Projection matrix representing the first layer network, b⁽¹⁾Is the amount of deviation of the first layer network. Phi is an S-shaped nonlinear activation function.

The output of the first layer network can be used as the input of the second layer network and recursion is performed in turn, and then the k-th layer network output can be expressed as:

wherein the content of the first and second substances,

sample(s)

The output of the top-most layer of the network can be expressed as:

wherein the mapping f is a parametric nonlinear function consisting of parameters

And

and (4) jointly determining.

Based on the method, the sample x is represented by the constructed depth network model_iAnd x_jEuclidean distance between them to measure the similarity between them:

for learning parameters in the deep network model

And

based on the edge Fisher analysis criterion, the following nonlinear depth metric learning function is constructed:

wherein the content of the first and second substances,

represents a sample x_iAnd x_jBelongs to the technical field of the direct-alignment,

represents a sample x_iAnd x_jBelonging to a negative pair. Alpha is a positive parameter, the internal compactness of the positive sample pair and the separability of the negative sample pair to the sample can be balanced, and beta (beta is more than 0) is a regularization parameter.

The Frobenius norm of the matrix is represented, and M and N represent the total number of positive and negative pairs in the training data, respectively.

Step 2, training the nonlinear depth metric learning model based on a given automatic driving target positive and negative sample set, and optimizing parameters of the nonlinear deep learning model based on a gradient descent method, specifically Wie

Given a training sample set x ═ x (x)₁,x₂,...,x_n) And E, sampling positive and negative samples according to a zero mean Gaussian distribution around the target sample image block, wherein the six parameter diagonal covariance matrix of the automatic driving positive sample is diag ([1,1,0, 0)]) Representing samples within a radius of two pixels around the object; the six parameter diagonal covariance matrix for the autopilot negative example is diag ([ w, h,0, 0)]) And w and h represent the width and height, respectively, of the target sample, sampled from distant negative samples away from the target object. Thus, some negative examples may contain both the background and a partial ontology of the target object. Randomly forming M parameters of a positive pair training depth network and N parameters of a negative pair training depth network based on the obtained training set;

because the constructed nonlinear depth measurement model is non-convex, a closed-form solution is difficult to obtain directly, and in order to solve the optimization problem, a gradient descent-based method is used for solving the parameter W^(k)And b^(k)Specifically:

wherein the content of the first and second substances,

representing the original input sample, sample x_i x_jStep length between

And

the topmost layer of (d) can be represented as follows:

wherein the content of the first and second substances,

representing a function

With respect to W^(K)The derivative of (c). Other layer variables

And

is represented as follows:

wherein, the lines indicate element-by-element multiplication,

can be expressed as follows:

updating parameters based on gradient descent algorithmW^(k)And b^(k)K1, 2, K until convergence:

where η is a learning rate, and is used to control the convergence rate of the objective function L.

Step 3, constructing a target observation model based on the particle filter to obtain the optimal estimation of the automatic driving target state, specifically:

suppose that the automatic driving target state vector at time r is h_r＝{h_rx,h_ry,sc_r,θ_r,φ_r,σ_rIn which h is_rx,h_ry,sc_r,θ_r,φ_r,σ_rFor six-degree-of-freedom affine transformation parameters, the motion model of the object between adjacent frames can be expressed as follows:

wherein the content of the first and second substances,

to represent

Obedient mean value of h_r-1The variance is a gaussian distribution of Σ, and Σ is a diagonal covariance matrix.

Since the candidate object updates the estimate only in the nearest neighbor frame, the motion model

While stationary, the optimal candidate target may be based directly on the observation model

Selecting, namely:

wherein gamma is a normalization factor, gamma is a constant for controlling the shape of the Gaussian kernel, and the simulation is 0.01.

And 4, updating the target template through an online tracking strategy combining short-term and long-term stable updating so as to realize effective tracking of the automatic driving target, wherein the method specifically comprises the following steps:

in the actual tracking process, the variable targets in the complex scene cannot be effectively tracked by keeping the target template unchanged, so that the template updating is always a hot spot problem of online target tracking. If the tracking is implemented based on the fixed template from the first frame, the tracker cannot capture the target well due to factors such as illumination change, background clutter or partial shielding and the like; conversely, if the template is updated quickly, each update introduces an error that gradually accumulates over time causing the tracker to drift away from the target. In order to solve the problems, the invention introduces an online tracking strategy combining short-term and long-term stable updating to update the target template.

Template initialization: firstly, determining the position of the first frame of the target, then obtaining the tracking result of the previous n frames based on the tracking method, normalizing the tracking result, and finally combining the tracking result into a template set T ═ T₁,t₂,···,t_n]∈R^b×n；

And (3) dynamic updating of the template: the similarity between the template and the tracking result can be expressed as psi [. psi₁,ψ₂,···,ψ_n]If the threshold value is rho, the similarity psi between the tracking result and the ith template is determined_iExpressed as:

in the formula (I), the compound is shown in the specification,

tracking for the r-th frameAs a result, the similarity value ψ_iLarger indicates that the tracking result is more similar to the template.

Let the maximum similarity be Λ, which is expressed as:

Λ＝maxψ_i

comparing the maximum similarity Lambda with a threshold rho, and if the maximum similarity Lambda is larger than the rho, indicating that the similarity of the tracking result and a certain target template is maximum, updating the corresponding template; otherwise, no update is made. In the simulation experiment, the threshold value is rho 0.7.

The effects of the present invention can be further illustrated by the following simulations:

simulation conditions are as follows: the hardware environment is as follows: intel Core (TM) i5-4258 CPU, dominant frequency 2.4GHz, memory 8GB, and experimental software test environment: python3.7, MATLAB 2017a, and open source deep learning framework Caffe. The simulation conditions were set as follows: the number of positive and negative samples extracted from the first frame by the tracking algorithm is respectively 100 and 400, and the number of positive and negative samples of each subsequent frame is respectively 30 and 120, so that 300 positive pairs and 900 negative pairs are generated. The tracking performance and the calculation complexity are balanced, if too many particles increase the calculation amount of the algorithm, and if too few particles cannot obtain the optimal target state, based on which the number of particles per frame is set to be 600, and the weight of the particles is initialized to be 1/600. The video tracking data set OTB-100 selected 6 video sequences of MotorRolling, Boy, Skating1, Bird2, Tiger2, Basketball, as test sets, which contained multiple tracking challenges. The CNN network used in the invention adopts a deep learning framework Caffe, the network weight value is updated by a gradient descent method, and a local area normalization parameter alpha is set to be 0.0001 and tau is set to be 0.75, so that the function of side inhibition is achieved, and the generalization capability of the network for extracting complex environment information is enhanced; the learning rate was set to 0.001 and the training period was 300 to minimize the occurrence of the "overfitting" phenomenon. The method adopts the average tracking overlapping rate and the average center position error to quantitatively analyze the tracking performance of the method.

Simulation content:

simulation 1: and (3) qualitative analysis: fig. 2 is a comparison of the results of 5 tracking algorithms for 6 test sequences. The MotorRolling sequence comprises the challenging factors of rapid motion, background clutter, illumination change and the like, the target in the 112 th frame is obviously changed from air falling, MIL and BACF have tracking drift or the phenomenon that a tracking frame is inconsistent with a real target, and the algorithm can always better track the target and can be attributed to the fact that the background clutter and the rapid motion influence are considered by the algorithm so as to accurately estimate the moving target. The target in the Basketball has obvious size change, the algorithm and the BCAF can locate the target and effectively track the target, and the method has a good tracking effect under the condition of size change. The target in Boy moves rapidly, meanwhile, factor interference such as scale change and rotation occurs, and the tracking drift phenomenon occurs in MIL after 418 frames. Skating1 belongs to a more complex scene, the target background contrast is lower, and there is a stronger illumination change. The target resolution ratio is low in the scene, and the template is timely updated by the algorithm through long and short time combination with an online updating strategy, so that stable tracking is realized. The algorithms proposed in Bird2 video sequences and Tiger2 video sequences can lock well to the target.

Simulation 2: quantitative analysis: as can be seen from tables 1 and 2, based on the 6 test sequences selected by the OTB-100, the proposed algorithm has better tracking effect than the comparison algorithm, which is attributable to the fact that the proposed algorithm employs depth distance metric learning and introduces error terms to construct a likelihood model to reduce the sensitivity between similar target backgrounds. Compared with a comparison algorithm, the algorithm has excellent performance under the conditions of shielding, noise and the like, and the main reasons can be expressed as follows:

(1) the correlation among candidate target templates is considered by the extracted model, so that the algorithm tracking robustness in a complex scene is improved;

(2) the depth distance measurement measures the similarity of particles, so that the tracking effectiveness is improved;

(3) the long-time and short-time updating strategy improves the robustness and tracking accuracy of the algorithm under the noise and shielding scenes.

TABLE 1 average overlap ratio for different tracking methods

TABLE 2 average center position error for different tracking methods

The invention adopts a success rate curve graph and an overall precision graph to evaluate the overall performance of the tracker. The overall accuracy map represents the percentage of successful frames to total frames within a distance threshold for the center position error. The success rate and overall accuracy curves obtained by the mentioned comparison algorithm are shown in fig. 3 and 4, respectively. As can be seen from FIGS. 3 and 4, the tracking success rate of the algorithm proposed in most sequences is higher than that of the comparison algorithm; the algorithm for tracking the precision map in the Tiger2 sequence is slightly inferior to the BCAF, but the tracking success rate curve is still superior to the BCAF, and the overall tracking precision of the algorithm in other sequences is also superior to that of the comparison algorithm. Therefore, the overall performance of the algorithm is better than that of a comparison method in a complex scene, and the robustness is better.

Simulation 3: average running speed of different tracking methods under each test sequence: in order to verify the tracking timeliness of the algorithm, the invention adopts the frame second (FPS) running every second to measure the algorithm speed (the algorithm runs for 50 times, and the average obtained FPS is used as an evaluation index), and the FPS obtained by each algorithm in different test sequences is shown in the table 3. As can be seen from Table 3, the proposed algorithm speed is higher than Struck, BCAF and DFT, and inferior to MIT. However, as mentioned above, the tracking performance of the proposed algorithm in each test sequence is overall better than that of the comparison algorithm.

TABLE 3 average running speed (FPS) for different tracking methods under each test sequence

In summary, the present invention provides an automatic driving target visual tracking method based on particle filtering and depth distance metric learning. The method constructs a nonlinear depth distance measurement learning model based on a depth network; then constructing an observation model based on the obtained optimal candidate target prediction value; and finally, updating the target template based on a short-term and long-term stable updating combined updating strategy. Based on the OTB-100 data set, 6 test sequences containing factors such as occlusion and illumination change are selected, and the effectiveness of the method is verified by comparing the test sequences with four mainstream trackers such as BACF, MIL, Struck and DFT. As can be seen from qualitative analysis, the method has better representation in scenes of partial shielding, illumination change, and the like; based on quantitative analysis, compared with a comparison algorithm, under most test scenes, the average central error of the algorithm is lower, the average overlapping rate is higher, and the overall tracking performance of the method is better. Therefore, the algorithm provided by the invention can provide a solid theoretical and engineering realization basis for automatic driving target visual tracking in a complex driving scene.

The embodiments of the present invention are illustrative, but not restrictive, of the invention in any manner. The technical features or combinations of technical features described in the embodiments of the present invention should not be considered as being isolated, and they may be combined with each other to achieve a better technical effect. The scope of the preferred embodiments of the present invention may also include additional implementations, and this should be understood by those skilled in the art to which the embodiments of the present invention pertain.

Claims

1. The target tracking method based on particle filtering and depth distance measurement learning is characterized by comprising the following steps of:

constructing a nonlinear depth measurement learning model;

2. The target tracking method based on particle filtering and depth distance metric learning of claim 1, wherein a nonlinear depth metric learning model is further constructed by using positive and negative samples of an automatic driving target obtained by nonlinear change of a depth network, specifically:

constructing a deep network model by combining the samples

Passing into a multi-layer non-linear transformation to learn its non-linear representation; the number of network layers K is 1,2^(k)Representing the dimension of the samples in the k-th layer network, the first layer samples are output as follows:

wherein, W⁽¹⁾Projection matrix representing the first layer network, b⁽¹⁾Is the deviation amount of the first layer network; phi is an S-shaped nonlinear activation function;

the output result of the first layer network is used as the input of the second layer network and recurses in turn, and then the output of the k-th layer network is expressed as:

wherein the content of the first and second substances,

sample(s)

The output at the top of the network is represented as:

And

jointly determining;

for learning parameters in the deep network model

And

wherein the content of the first and second substances,

represents a sample x_iAnd x_jBelong to a negative pair; alpha is a positive parameter, the internal compactness of the automatic driving target positive sample and the separability of the automatic driving target negative sample to the internal sample are balanced, and beta (beta is more than 0) is a regularization parameter;

3. The target tracking method based on particle filtering and depth distance metric learning of claim 1, wherein based on a given set of positive and negative samples of an automatic driving target, the nonlinear depth metric learning model is trained and its nonlinear deep learning model parameters are optimized based on a gradient descent method, specifically:

given a training sample set x ═ x (x)₁,x₂,...,x_n) And E, sampling positive and negative samples according to a zero mean Gaussian distribution around the target sample image block, wherein the six parameter diagonal covariance matrix of the automatic driving target positive sample is diag ([1,1,0, 0)]) Representing a sample within a radius of two pixels around the target; the six parameter diagonal covariance matrix for the autopilot target negative example is diag ([ w, h,0, 0)]) W and h respectively represent the width and height of the target sample, and sampling is performed from a negative sample far away from the target object; randomly forming M parameters of a positive pair training depth network and N parameters of a negative pair training depth network based on the obtained training set;

solving for parameter W using a gradient descent algorithm^(k)And b^(k)Specifically:

wherein the content of the first and second substances,

which represents the original input sample, is,

to be related to the vector

Is transposed, then sample x_ix_jStep length between

And

the top-most layer of (d) is represented as follows:

wherein the content of the first and second substances,

representing a function

With respect to W^(K)A derivative of (a); other layer variables

And

is represented as follows:

wherein, the lines indicate element-by-element multiplication,

comprises the following steps:

updating parameter W based on gradient descent algorithm^(k)And b^(k)K1, 2, K until convergence:

4. The target tracking method based on particle filtering and depth distance metric learning of claim 1, wherein a target observation model is constructed based on particle filtering to obtain an optimal estimation of the automatic driving target state, specifically: suppose that the automatic driving target state vector at time r is h_r＝{h_rx,h_ry,sc_r,θ_r,φ_r,σ_rIn which h is_rx,h_ry,sc_r,θ_r,φ_r,σ_rFor six-degree-of-freedom affine transformation parameters, the motion model of the driving target between adjacent frames is expressed as follows:

wherein the content of the first and second substances,

to represent

Obedient mean value of h_r-1The variance is Gaussian distribution of sigma, and the sigma is a diagonal covariance matrix; motion model

While stationary, the optimal candidate target is directly based on the observation model

Selecting, namely:

where Γ is a normalization factor and γ is a constant that controls the shape of the gaussian kernel.

5. The target tracking method based on particle filtering and depth distance metric learning according to claim 1, characterized in that the target template is updated by an online tracking strategy combining short-term and long-term stable updating, so as to realize effective tracking of the automatic driving target, specifically:

firstly, determining the position of the first frame of the driving target, then obtaining the tracking result of the previous n frames based on the provided tracking method and carrying out normalization processing, and finally combining the tracking result into a template set T ═ T [ T ]₁,t₂,…,t_n]∈R^b×n；

The similarity between the template and the tracking result is expressed as psi [. psi₁,ψ₂,…,ψ_n]If the threshold value is rho, the similarity psi between the tracking result and the ith template is determined_iExpressed as:

in the formula (I), the compound is shown in the specification,

for the r frameTrace result, similarity value psi_iThe larger the tracking result, the more similar the tracking result is to the template;

let the maximum similarity be Λ, which is expressed as:

Λ＝maxψ_i

comparing the maximum similarity Lambda with a threshold rho, and if the maximum similarity Lambda is larger than the rho, indicating that the similarity of the tracking result and a certain target template is maximum, updating the corresponding template; otherwise, the updating is not carried out.