CN111008996B

CN111008996B - Target tracking method through hierarchical feature response fusion

Info

Publication number: CN111008996B
Application number: CN201911250349.XA
Authority: CN
Inventors: 柳培忠; 邓建华; 张万程; 杜永兆; 陈智; 吴奕红; 杨建兰
Original assignee: Quanzhou Zhongfang Hongye Information Technology Co ltd; Huaqiao University
Current assignee: Quanzhou Zhongfang Hongye Information Technology Co ltd; Huaqiao University
Priority date: 2019-12-09
Filing date: 2019-12-09
Publication date: 2023-04-07
Anticipated expiration: 2039-12-09
Also published as: CN111008996A

Abstract

The invention discloses a target tracking method through hierarchical feature response fusion, and relates to the field of computer vision target tracking; the method comprises the following steps: step 10, initializing parameters; step 20, extracting the layered features of the target image, and performing response value fusion to obtain a position model; step 30, training the maximum scale response value of the scale correlation filter to obtain a scale model; step 40, when the fusion response value obtained after the response value fusion in the step 20 is less than or equal to the set threshold, re-detecting the target image to obtain a candidate region, and returning to the step 20; when the fusion response value is larger than a set threshold value, updating the position model and the scale model, and then entering step 50; and step 50, using the updated position model and the updated scale model for tracking the next frame, and returning to the step 40. The method provided by the invention changes the conditions of the self-adaptive fusion of the hierarchical characteristics and the updating of the model, improves the tracking accuracy of the relevant filter and makes the tracking effect more ideal.

Description

Target tracking method through hierarchical feature response fusion

Technical Field

The invention relates to the field of computer vision target tracking, in particular to a target tracking method through hierarchical feature response fusion.

Background

Visual target tracking is a basic task in the field of computer vision, and has very wide application in the aspects of automatic driving, robots, video monitoring, human-computer interaction and the like. Although the initial frame of the target is given, it remains a challenge how to predict the position of the target in successive video frames using an efficient method. Despite the great advances made in recent years, many external factors are still challenged. For example, the target typically experiences some disturbances, such as occlusion, background blurring, fast motion, illumination changes, deformation, scale changes, and out-of-view, which affect the accuracy and robustness of target tracking.

Currently, two major directions for target tracking are correlation filter-based and deep learning-based methods. The correlation filter method benefits from the fast operation and the tracking speed is very fast, but the correlation filter method has the defect that the tracking accuracy is too low, so that the tracking effect is not ideal; the deep learning method has the advantages that abundant target image features can be generated, targets can be well distinguished in the tracking process, the tracking accuracy is improved greatly, but due to the processes of feature extraction and the like, the calculated amount is huge, and the tracking speed cannot reach real time. Although the prior art achieves the expected tracking result, various interferences are inevitably encountered in the tracking process, which will introduce some wrong background information and will be transmitted to the next frame, and the long-time accumulation will deteriorate the quality of the tracking model and finally generate the situation of tracking target drift or tracking failure. It is important how to decide when to update the location model and the scale model.

Based on the above, the present inventors have further explored and studied it, and have proposed a target tracking method by hierarchical feature response fusion.

Disclosure of Invention

The invention aims to solve the technical problem of providing a target tracking method through hierarchical feature response fusion, and solves the problems of low tracking accuracy and unsatisfactory tracking effect of a related filter by changing the conditions of hierarchical feature adaptive fusion and model updating.

The technical problem to be solved by the invention is realized as follows:

a method of target tracking through hierarchical feature response fusion, comprising:

step 10, initializing parameters;

step 20, extracting the layered features of the target image, and performing response value fusion to obtain a position model;

step 30, training the maximum scale response value of the scale correlation filter to obtain a scale model;

step 40, when the fusion response value obtained after the response value fusion in the step 20 is less than or equal to the set threshold, re-detecting the target image to obtain a candidate region, and returning to the step 20; when the fusion response value is larger than a set threshold value, updating the position model and the scale model, and then entering step 50;

and step 50, using the updated position model and the updated scale model for tracking the next frame, and returning to the step 40.

Further, in step 10, the parameters include: hierarchical correlation filter W _t ^l I l =3,4,5}, filter regularization term weight factors lambda and lambda 1, a tracking model learning factor eta, a scale series S, a scale increment factor theta, a weight updating parameter tau, a fixed threshold delta and a weight factor alpha.

Further, the step 20 includes:

step 201, when the t frame of the video sequence is, the target point P (x) _t ,y _t ) Obtaining a region with a set size as a target sample for the center, putting the target sample into a convolutional neural network, and extracting convolution characteristics of 3 layers, 4 layers and 5 layers to obtain characteristic images of 3 layers, 4 layers and 5 layers;

step 202, learning on 3,4 and 5 layers of a convolutional neural network respectively to obtain three initial context correlation filters;

respectively carrying out cyclic shift on the 3,4 and 5 layers of feature images to form a training sample to obtain a data matrix and expected output, and optimizing an initial context-dependent filter w' by using the data matrix and the expected output, as shown in formula (1), to obtain an optimized context-aware filter:

wherein w is an optimized context-dependent filter, w' is an initial context-dependent filter, λ 1 is a regularization term weight factor, U ₀ Is the data matrix and y is the desired output;

step 203, convolving the three optimized context-aware filters with the feature images of the 3,4 and 5 layers, obtaining a response vector of the feature image by using a formula (2), and then searching a position corresponding to the feature image with the maximum response vector, namely the predicted position of the tracking target;

wherein z is a feature image, w is an optimized context-aware filter, z is a feature image,

being an inverse Fourier transform,. The | _ being the dot product between matrix elements, f (z) being the response vector of the eigen image;

step 204, updating the position parameters by adopting a linear interpolation method so as to update the position model, wherein the updating of the position parameters is as in formulas (3 a) and (3 b):

wherein i is the sequence number of the current frame, η is the learning factor of the tracking model,

the closed solution in the Fourier domain is determined for the training sample parameters by using the properties of the circulant matrix, and the result is evaluated>

An updated location model for the target sample;

step 205, recording three output response values obtained by convolving the feature images of the 3,4 and 5 layers with the optimized context correlation filter as R _context3 、R _context4 And R _context5 Then, the response value weight of each layer at t frames is normalized as follows, formulas (4 a), (4 b), (4 c):

updating the initial response value weight by using the response value weight of the t frame according to the following formulas (5 a), (5 b) and (5 c):

/>

where τ is the weight update parameter, context3_ w _t 、context4_w _t And context5_ w _t Indicating initial ringing at t framesThe response weight;

step 206, in the t frame, the response values of the feature images of the 3,4 and 5 layers are fused through the formula (6) to obtain a fused output response value R _t Obtaining a final position model, and obtaining the position of a tracking target according to the final position model;

wherein context3_ w _t 、context4_w _t And context5_ w _t Denotes the initial response value weight, R, at t frames _context3 、R _context4 And R _context5 And the characteristic images of 3,4 and 5 layers are respectively convolved with the optimized context correlation filter to obtain output response values.

Further, the step 30 includes:

step 301, setting the block size of the extraction scale evaluation target image as:

p, R is the width and height of the previous frame of the target sample; theta is a scale increment factor; s is a scale progression;

step 302, minimizing the cost function through the formula (8) to obtain the scale correlation filter:

wherein epsilon is a cost function, h is a scale correlation filter, g is ideal correlation output, l represents the dimension of the feature, lambda is a regular term weight factor, f is a response vector of the feature image, and d is the feature dimension number;

step 303, solving the formula (8) in the frequency domain, as formula (9), for estimating the target scale:

where H is a scale-dependent filter in the frequency domain, l represents the dimension of the feature, H ^l A scale-dependent filter in dimension l; f ^k For the kth training sample, F ^l Is the training sample of the l-th dimension, G is the ideal correlation output,

is the complex conjugate of the ideally correlated output, <' >>

Is the complex conjugate of the kth training sample, lambda is the regular term weight factor, t is the frame number, and d and k are the feature dimension number;

step 304, for obtaining robust result, for H in formula (9) ^l And respectively updating the numerator denominators so as to update the scale model:

/>

wherein eta is a tracking model learning factor, F _t ^k For the k-th training sample,

for the complex conjugate of the kth training sample, G _t For ideal correlation output, F _t ^l The training sample is the training sample of the l-th dimension, lambda is the regular term weight factor, t is the frame number, l is the dimension, d and k are the feature dimension degree; the scale model update is the update>

And E _t And (4) updating.

In the next frame, step 305, the response value of the scale-dependent filter can be determined by solving equation (11):

where Z is a set of feature images Z.

Further, the step 40 specifically includes:

step 401, determining the fusion output response value R obtained in step 20 _t If the difference is less than or equal to the fixed threshold value delta, executing step 402 if the difference is greater than or equal to the fixed threshold value delta, and executing step 403 if the difference is not greater than the fixed threshold value delta;

step 402, generating a plurality of candidate regions C on the whole image through the EdgeBox redetector _d Then calculate the confidence value of the candidate region, learn the optimized context-aware filter using a learning rate, obtain the maximum response value as g (c), then select an optimal candidate region as the re-detection result by minimizing equation (12), and then return to step 20,

wherein g (c) represents a maximum response value, α represents a weighting factor,

representing candidate regions in t frames, D representing each

And a bounding box c _t-1 The center position distance of (a);

in step 403, the position model is updated by using formulas (3 a) and (3 b), the scale model is updated by using formulas (10 a) and (10 b), and then the process proceeds to step 50.

The invention has the following advantages:

1. extracting image characteristics by adopting a convolutional neural network, outputting corresponding response values by combining a context correlation filter, and carrying out self-adaptive fusion on the three response values so that a position model can well predict the position of a target object;

2. a scale-dependent filter is adopted to carry out rapid scale estimation on the target, so that the scale transformation capability is improved to a certain extent, and the tracking accuracy is improved;

3. and combining an EdgeBox redetector, when the tracking fails, the tracking drifts and the like, redetecting the target image by using the redetector, and updating the position model and the scale model when the conditions are met.

Drawings

The invention will be further described with reference to the following examples and figures.

FIG. 1 is a flowchart illustrating an implementation of a target tracking method through hierarchical feature response fusion according to an embodiment of the present disclosure;

FIG. 2 is a schematic flowchart of a target tracking method through hierarchical feature response fusion according to an embodiment of the present disclosure;

fig. 3 is a graph of tracking accuracy for an embodiment of the present description for 100 video sequences in an OTB-2015 data set;

fig. 4 is a graph of success rate of the embodiments of the present specification focusing on 100 video sequences in OTB-2015 data;

fig. 5 is a graph of tracking accuracy for the illumination variation property of an embodiment of the present specification in an OTB-2015 dataset over 100 video sequences;

fig. 6 is a graph of success rate of the illumination variation attribute of the embodiments of the present specification in OTB-2015 dataset over 100 video sequences;

fig. 7 is a graph of tracking accuracy for the scale-change property of an embodiment of the present specification in an OTB-2015 dataset over 100 video sequences;

fig. 8 is a graph of success rate of the embodiments of the present specification for the scale-change property of 100 video sequences in the OTB-2015 dataset;

fig. 9 is a tracking accuracy graph of rotation attributes in the plane of 100 video sequences in an OTB-2015 data set for an embodiment of the description;

fig. 10 is a graph of success rate of rotation attribute in the plane of 100 video sequences in OTB-2015 dataset for an embodiment of the present description;

fig. 11 is a tracking accuracy graph of the background blur property of the OTB-2015 data set in 100 video sequences for an embodiment of the specification;

fig. 12 is a graph of the success rate of the background blur property of the present invention in OTB-2015 data set on 100 video sequences.

FIG. 13 is a graph of tracking accuracy for occlusion attributes of an embodiment of the present specification in an OTB-2015 dataset over 100 video sequences;

fig. 14 is a graph of the success rate of the occlusion property of the present invention in OTB-2015 data set to 100 video sequences.

Detailed Description

Referring to fig. 1 and fig. 2, a target tracking method through hierarchical feature response fusion provided in an embodiment of the present specification may include the following steps:

step 10, initializing parameters, wherein the parameters comprise: hierarchical correlation filter W _t ^l I =3,4,5}, filter regularization term weight factors lambda and lambda 1, a tracking model learning factor eta, a scale series S, a scale increment factor theta, a weight updating parameter tau, a fixed threshold delta and a weight factor alpha;

step 20, extracting the hierarchical characteristics of the target image, and performing response value fusion to obtain a position model; the method specifically comprises the following steps:

step 201, during the t frame of the video sequence, the target point P (x) is used _t ,y _t ) Obtaining a region with a set size as a target sample for the center, putting the target sample into a convolutional neural network, and extracting convolution characteristics of 3 layers, 4 layers and 5 layers to obtain characteristic images of 3 layers, 4 layers and 5 layers;

the obtained optimized context-aware filter is a context-aware filter which has high response to the target image block and has near zero response to the context image block;

is an inverse Fourier transform, is a dot product between matrix elements, and f (z) is a response vector of the eigen image;

whereinI is the sequence number of the current frame, η is the tracking model learning factor,

evaluating the closed solution in Fourier domain for training sample parameters by using the properties of circulant matrix, and based on the evaluation of the closed solution in Fourier domain>

An updated location model for the target sample; the position model is updated as->

And/or>

Updating of (1);

step 205, recording three output response values obtained by convolving the feature images of the 3,4 and 5 layers with the optimized context correlation filter as R _context3 、R _context4 And R _context5 Then, the response value weight of each layer at t frames is normalized, as in the following formulas (4 a), (4 b), (4 c):

the filter response value takes a larger proportion and is distributed with a higher weight, and the initial response value weight is updated by using the response value weight of the t frame, and the following formulas (5 a), (5 b) and (5 c) are used:

where τ is the weight update parameter, context3_ w _t 、context4_w _t And context5_ w _t Representing the initial response value weight at t frames;

Step 30, training the maximum scale response value of the scale correlation filter to obtain a scale model; the method specifically comprises the following steps:

step 301, setting the size of the extraction scale evaluation target image block as:

wherein epsilon is a cost function, h is a scale correlation filter, g is an ideal correlation output, l represents the dimension of the feature, lambda is a regular term weight factor, f is a response vector of the feature image, and d is the feature dimension number;

where H is a scale-dependent filter in the frequency domain, l represents the dimension of the feature, H ^l A scale dependent filter in dimension I; f ^k For the kth training sample, F ^l Is the training sample of the l-th dimension, G is the ideal correlation output,

complex conjugate for ideal correlation output>

Is the complex conjugate of the kth training sample, lambda is a regular term weight factor, t is a frame number, and d and k are feature dimension degrees;

wherein eta is a tracking model learning factor, F _t ^k Is as followsThe number of k training samples is then,

And E _t And (4) updating.

wherein Z is a set of feature images Z;

through the steps, the accurate scale estimation method is realized, and the adaptability to target scale change is improved.

Step 40, when the fusion response value obtained after the response value fusion in the step 20 is less than or equal to the set threshold, re-detecting the target image to obtain a candidate region, and returning to the step 20; when the fusion response value is larger than a set threshold value, updating the position model and the scale model, and then entering step 50; the method specifically comprises the following steps:

step 401, determining the fusion output response value R obtained in step 20 _t If the difference is smaller than or equal to the fixed threshold value delta, executing a step 402 if the difference is larger than the fixed threshold value delta, and executing a step 403 if the difference is not larger than the fixed threshold value delta; only when the fusion response value calculated by the formula (6) is less than or equal to the fixed threshold, the situation that the tracking effect is not good or the tracking failure occurs is indicated, and re-detection is needed;

step 402, generating a plurality of candidate regions C on the whole image by the EdgeBox redetector _d Then, a confidence value of the candidate region is calculated, and a learning rate is used to learn the optimized context senseKnowing the filter, aiming at maintaining a long-time appearance change memory, obtaining the maximum response value as g (c), then selecting an optimal candidate area as a re-detection result by minimizing formula (12), and then returning to step 20;

representing candidate regions in t frames, D representing each

And a bounding box c _t-1 The center position distance of (a);

step 403, updating the position model by using formulas (3 a) and (3 b), updating the scale model by using formulas (10 a) and (10 b), and then entering step 50.

The EdgeBox is introduced as a re-detector in the tracking process and used for processing the condition of tracking failure and improving the tracking robustness.

Please refer to fig. 3 to 14, which are the results of graphs automatically generated by matlab software. Fig. 3-14 compare in various ways the tracking accuracy and tracking success rate of the method (deployed) of the embodiments of the present specification with other target tracking methods or algorithms, including CNN-SVM, stage _ CA, C-COT _ HOG, SAMF _ AT, stage, CFNet _ conv3, SRDCF, LMCF, siamCF, SAMF _ CA, LCT, DSST, and KCF. The contents of the boxes on the right of fig. 3 to 14, from top to bottom, show the method (or algorithm) going from good to bad. As can be seen from fig. 3 to fig. 14, the method of the embodiment of the present specification is substantially at the first position in the 100 video sequences in the OTB-2015 data set, and the method of the embodiment of the present specification has great advantages in terms of tracking accuracy and tracking success rate compared with other methods.

The meaning of the accuracy plots in fig. 3 to 14 is: in the tracking accuracy evaluation, one widely used criterion is a center position error, which is defined as an average euclidean distance between the center position of the tracking target and an accurate position that is manually calibrated. The accuracy map can show the percentage of frames in the total number of frames for which the estimated position is within a threshold distance of a given accuracy value.

The meaning of the success rate graphs in fig. 3 to 14 is: in the success rate evaluation, the evaluation criterion is the overlapping rate of bounding boxes. Assume the bounding box of the trace is γ _t The exact bounding box is γ _a The overlap ratio is defined as S = | γ _t ∩γ _a |/|γ _t ∪γ _a And | in which |, n and £ respectively represent the intersection and union of two regions, and | |, refers to the number of pixels in its region. To gauge the performance of the algorithm over a series of frames, we calculate that the overlap ratio S is greater than a given threshold t _o The number of successful frames. The success rate map gives the proportion of successful frames when the threshold is varied from 0 to 1.

In the embodiment of the specification, a convolutional neural network is adopted to extract image features, a context-dependent filter is combined to output corresponding response values, and the three response values are subjected to adaptive fusion, so that a position model can well predict the position of a target object; a scale-dependent filter is adopted to carry out rapid scale estimation on the target, so that the scale transformation capability is improved to a certain extent, and the tracking accuracy is improved; and combining an EdgeBox redetector, when the tracking fails, the tracking drifts and the like, redetecting the target image by using the redetector, and updating the position model and the scale model when the conditions are met.

Although specific embodiments of the invention have been described above, it will be understood by those skilled in the art that the specific embodiments described are illustrative only and are not limiting upon the scope of the invention, and that equivalent modifications and variations can be made by those skilled in the art without departing from the spirit of the invention, which is to be limited only by the appended claims.

Claims

1. A target tracking method through hierarchical feature response fusion, comprising:

step 10, initializing parameters;

step 20, extracting the layered features of the target image, and performing response value fusion to obtain a position model, wherein the step comprises the following steps:

respectively carrying out cyclic shift on the characteristic images of the 3,4 and 5 layers to form a training sample to obtain a data matrix and expected output, and optimizing an initial context correlation filter w' by using the data matrix and the expected output, as shown in formula (1), to obtain an optimized context correlation filter:

step 203, convolving the three optimized context-dependent filters with the feature images of the 3,4 and 5 layers, obtaining a response vector of the feature image by using a formula (2), and then searching a position corresponding to the feature image with the maximum response vector, namely the predicted position of the tracking target;

wherein z is a characteristic image,

An updated location model for the target sample;

/>

where τ is a weight update parameter, context3_ w _t 、context4_w _t And context5_ w _t Representing the initial response value weight at t frames;

wherein context3_ w _t 、context4_w _t And context5_ w _t Denotes the initial response value weight, R, at t frames _context3 、R _context4 And R _context5 Convolving the feature images of 3,4 and 5 layers with an optimized context-dependent filter to obtain output response values;

step 30, training the maximum scale response value of the scale correlation filter to obtain a scale model, including:

is the complex conjugate of the ideally correlated output, <' >>

for the complex conjugate of the kth training sample, G _t For ideal correlation output, F _t ^l The training sample is the training sample of the l-th dimension, lambda is the regular term weight factor, t is the frame number, l is the dimension, d and k are the feature dimension degree; the scale model update is update D _t ^l And E _t Updating of (1); />

wherein Z is a set of feature images Z;

step 40, when the fusion response value obtained after the response value fusion in the step 20 is less than or equal to the set threshold, re-detecting the target image to obtain a candidate region, and returning to the step 20; when the fusion response value is larger than a set threshold value, updating the position model and the scale model, and then entering step 50; the method comprises the following steps:

step 401, determining the fusion output response value R obtained in step 20 _t If the threshold value is less than or equal to the fixed threshold value delta, executing the step if the threshold value is less than or equal to the fixed threshold value deltaStep 402, if not, go to step 403;

step 402, generating a plurality of candidate regions C on the whole image by the EdgeBox redetector _d Then calculate the confidence value of the candidate region, learn the optimized context correlation filter using a learning rate, obtain the maximum response value as g (c), then select an optimal candidate region as the re-detection result by minimizing equation (12), and then return to step 20,

represents a candidate area in t frames, D represents each ^ R>

And a bounding box c _t-1 The center position distance of (a);

step 403, updating the position model by using formulas (3 a) and (3 b), updating the scale model by using formulas (10 a) and (10 b), and then entering step 50;

2. The method for tracking the target through the fusion of the hierarchical feature responses as claimed in claim 1, wherein: in step 10, the parameters include: hierarchical correlation filter W _t ^l I l =3,4,5}, filter regularization term weight factors lambda and lambda 1, a tracking model learning factor eta, a scale series S, a scale increment factor theta, a weight updating parameter tau, a fixed threshold delta and a weight factor alpha.