CN111008996B - Target tracking method through hierarchical feature response fusion - Google Patents

Target tracking method through hierarchical feature response fusion Download PDF

Info

Publication number
CN111008996B
CN111008996B CN201911250349.XA CN201911250349A CN111008996B CN 111008996 B CN111008996 B CN 111008996B CN 201911250349 A CN201911250349 A CN 201911250349A CN 111008996 B CN111008996 B CN 111008996B
Authority
CN
China
Prior art keywords
scale
model
tracking
target
response value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911250349.XA
Other languages
Chinese (zh)
Other versions
CN111008996A (en
Inventor
柳培忠
邓建华
张万程
杜永兆
陈智
吴奕红
杨建兰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Quanzhou Zhongfang Hongye Information Technology Co ltd
Huaqiao University
Original Assignee
Quanzhou Zhongfang Hongye Information Technology Co ltd
Huaqiao University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Quanzhou Zhongfang Hongye Information Technology Co ltd, Huaqiao University filed Critical Quanzhou Zhongfang Hongye Information Technology Co ltd
Priority to CN201911250349.XA priority Critical patent/CN111008996B/en
Publication of CN111008996A publication Critical patent/CN111008996A/en
Application granted granted Critical
Publication of CN111008996B publication Critical patent/CN111008996B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20004Adaptive image processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a target tracking method through hierarchical feature response fusion, and relates to the field of computer vision target tracking; the method comprises the following steps: step 10, initializing parameters; step 20, extracting the layered features of the target image, and performing response value fusion to obtain a position model; step 30, training the maximum scale response value of the scale correlation filter to obtain a scale model; step 40, when the fusion response value obtained after the response value fusion in the step 20 is less than or equal to the set threshold, re-detecting the target image to obtain a candidate region, and returning to the step 20; when the fusion response value is larger than a set threshold value, updating the position model and the scale model, and then entering step 50; and step 50, using the updated position model and the updated scale model for tracking the next frame, and returning to the step 40. The method provided by the invention changes the conditions of the self-adaptive fusion of the hierarchical characteristics and the updating of the model, improves the tracking accuracy of the relevant filter and makes the tracking effect more ideal.

Description

Target tracking method through hierarchical feature response fusion
Technical Field
The invention relates to the field of computer vision target tracking, in particular to a target tracking method through hierarchical feature response fusion.
Background
Visual target tracking is a basic task in the field of computer vision, and has very wide application in the aspects of automatic driving, robots, video monitoring, human-computer interaction and the like. Although the initial frame of the target is given, it remains a challenge how to predict the position of the target in successive video frames using an efficient method. Despite the great advances made in recent years, many external factors are still challenged. For example, the target typically experiences some disturbances, such as occlusion, background blurring, fast motion, illumination changes, deformation, scale changes, and out-of-view, which affect the accuracy and robustness of target tracking.
Currently, two major directions for target tracking are correlation filter-based and deep learning-based methods. The correlation filter method benefits from the fast operation and the tracking speed is very fast, but the correlation filter method has the defect that the tracking accuracy is too low, so that the tracking effect is not ideal; the deep learning method has the advantages that abundant target image features can be generated, targets can be well distinguished in the tracking process, the tracking accuracy is improved greatly, but due to the processes of feature extraction and the like, the calculated amount is huge, and the tracking speed cannot reach real time. Although the prior art achieves the expected tracking result, various interferences are inevitably encountered in the tracking process, which will introduce some wrong background information and will be transmitted to the next frame, and the long-time accumulation will deteriorate the quality of the tracking model and finally generate the situation of tracking target drift or tracking failure. It is important how to decide when to update the location model and the scale model.
Based on the above, the present inventors have further explored and studied it, and have proposed a target tracking method by hierarchical feature response fusion.
Disclosure of Invention
The invention aims to solve the technical problem of providing a target tracking method through hierarchical feature response fusion, and solves the problems of low tracking accuracy and unsatisfactory tracking effect of a related filter by changing the conditions of hierarchical feature adaptive fusion and model updating.
The technical problem to be solved by the invention is realized as follows:
a method of target tracking through hierarchical feature response fusion, comprising:
step 10, initializing parameters;
step 20, extracting the layered features of the target image, and performing response value fusion to obtain a position model;
step 30, training the maximum scale response value of the scale correlation filter to obtain a scale model;
step 40, when the fusion response value obtained after the response value fusion in the step 20 is less than or equal to the set threshold, re-detecting the target image to obtain a candidate region, and returning to the step 20; when the fusion response value is larger than a set threshold value, updating the position model and the scale model, and then entering step 50;
and step 50, using the updated position model and the updated scale model for tracking the next frame, and returning to the step 40.
Further, in step 10, the parameters include: hierarchical correlation filter W t l I l =3,4,5}, filter regularization term weight factors lambda and lambda 1, a tracking model learning factor eta, a scale series S, a scale increment factor theta, a weight updating parameter tau, a fixed threshold delta and a weight factor alpha.
Further, the step 20 includes:
step 201, when the t frame of the video sequence is, the target point P (x) t ,y t ) Obtaining a region with a set size as a target sample for the center, putting the target sample into a convolutional neural network, and extracting convolution characteristics of 3 layers, 4 layers and 5 layers to obtain characteristic images of 3 layers, 4 layers and 5 layers;
step 202, learning on 3,4 and 5 layers of a convolutional neural network respectively to obtain three initial context correlation filters;
respectively carrying out cyclic shift on the 3,4 and 5 layers of feature images to form a training sample to obtain a data matrix and expected output, and optimizing an initial context-dependent filter w' by using the data matrix and the expected output, as shown in formula (1), to obtain an optimized context-aware filter:
Figure BDA0002308853270000021
wherein w is an optimized context-dependent filter, w' is an initial context-dependent filter, λ 1 is a regularization term weight factor, U 0 Is the data matrix and y is the desired output;
step 203, convolving the three optimized context-aware filters with the feature images of the 3,4 and 5 layers, obtaining a response vector of the feature image by using a formula (2), and then searching a position corresponding to the feature image with the maximum response vector, namely the predicted position of the tracking target;
Figure BDA0002308853270000031
wherein z is a feature image, w is an optimized context-aware filter, z is a feature image,
Figure BDA0002308853270000032
being an inverse Fourier transform,. The | _ being the dot product between matrix elements, f (z) being the response vector of the eigen image;
step 204, updating the position parameters by adopting a linear interpolation method so as to update the position model, wherein the updating of the position parameters is as in formulas (3 a) and (3 b):
Figure BDA0002308853270000033
Figure BDA0002308853270000034
wherein i is the sequence number of the current frame, η is the learning factor of the tracking model,
Figure BDA0002308853270000035
the closed solution in the Fourier domain is determined for the training sample parameters by using the properties of the circulant matrix, and the result is evaluated>
Figure BDA0002308853270000036
An updated location model for the target sample;
step 205, recording three output response values obtained by convolving the feature images of the 3,4 and 5 layers with the optimized context correlation filter as R context3 、R context4 And R context5 Then, the response value weight of each layer at t frames is normalized as follows, formulas (4 a), (4 b), (4 c):
Figure BDA0002308853270000037
Figure BDA0002308853270000038
Figure BDA0002308853270000039
updating the initial response value weight by using the response value weight of the t frame according to the following formulas (5 a), (5 b) and (5 c):
Figure BDA00023088532700000310
/>
Figure BDA00023088532700000311
Figure BDA00023088532700000312
where τ is the weight update parameter, context3_ w t 、context4_w t And context5_ w t Indicating initial ringing at t framesThe response weight;
step 206, in the t frame, the response values of the feature images of the 3,4 and 5 layers are fused through the formula (6) to obtain a fused output response value R t Obtaining a final position model, and obtaining the position of a tracking target according to the final position model;
Figure BDA0002308853270000041
wherein context3_ w t 、context4_w t And context5_ w t Denotes the initial response value weight, R, at t frames context3 、R context4 And R context5 And the characteristic images of 3,4 and 5 layers are respectively convolved with the optimized context correlation filter to obtain output response values.
Further, the step 30 includes:
step 301, setting the block size of the extraction scale evaluation target image as:
Figure BDA0002308853270000042
p, R is the width and height of the previous frame of the target sample; theta is a scale increment factor; s is a scale progression;
step 302, minimizing the cost function through the formula (8) to obtain the scale correlation filter:
Figure BDA0002308853270000043
wherein epsilon is a cost function, h is a scale correlation filter, g is ideal correlation output, l represents the dimension of the feature, lambda is a regular term weight factor, f is a response vector of the feature image, and d is the feature dimension number;
step 303, solving the formula (8) in the frequency domain, as formula (9), for estimating the target scale:
Figure BDA0002308853270000044
where H is a scale-dependent filter in the frequency domain, l represents the dimension of the feature, H l A scale-dependent filter in dimension l; f k For the kth training sample, F l Is the training sample of the l-th dimension, G is the ideal correlation output,
Figure BDA0002308853270000051
is the complex conjugate of the ideally correlated output, <' >>
Figure BDA0002308853270000052
Is the complex conjugate of the kth training sample, lambda is the regular term weight factor, t is the frame number, and d and k are the feature dimension number;
step 304, for obtaining robust result, for H in formula (9) l And respectively updating the numerator denominators so as to update the scale model:
Figure BDA0002308853270000053
Figure BDA0002308853270000054
/>
wherein eta is a tracking model learning factor, F t k For the k-th training sample,
Figure BDA0002308853270000055
for the complex conjugate of the kth training sample, G t For ideal correlation output, F t l The training sample is the training sample of the l-th dimension, lambda is the regular term weight factor, t is the frame number, l is the dimension, d and k are the feature dimension degree; the scale model update is the update>
Figure BDA0002308853270000056
And E t And (4) updating.
In the next frame, step 305, the response value of the scale-dependent filter can be determined by solving equation (11):
Figure BDA0002308853270000057
where Z is a set of feature images Z.
Further, the step 40 specifically includes:
step 401, determining the fusion output response value R obtained in step 20 t If the difference is less than or equal to the fixed threshold value delta, executing step 402 if the difference is greater than or equal to the fixed threshold value delta, and executing step 403 if the difference is not greater than the fixed threshold value delta;
step 402, generating a plurality of candidate regions C on the whole image through the EdgeBox redetector d Then calculate the confidence value of the candidate region, learn the optimized context-aware filter using a learning rate, obtain the maximum response value as g (c), then select an optimal candidate region as the re-detection result by minimizing equation (12), and then return to step 20,
Figure BDA0002308853270000058
wherein g (c) represents a maximum response value, α represents a weighting factor,
Figure BDA0002308853270000059
representing candidate regions in t frames, D representing each
Figure BDA00023088532700000510
And a bounding box c t-1 The center position distance of (a);
in step 403, the position model is updated by using formulas (3 a) and (3 b), the scale model is updated by using formulas (10 a) and (10 b), and then the process proceeds to step 50.
The invention has the following advantages:
1. extracting image characteristics by adopting a convolutional neural network, outputting corresponding response values by combining a context correlation filter, and carrying out self-adaptive fusion on the three response values so that a position model can well predict the position of a target object;
2. a scale-dependent filter is adopted to carry out rapid scale estimation on the target, so that the scale transformation capability is improved to a certain extent, and the tracking accuracy is improved;
3. and combining an EdgeBox redetector, when the tracking fails, the tracking drifts and the like, redetecting the target image by using the redetector, and updating the position model and the scale model when the conditions are met.
Drawings
The invention will be further described with reference to the following examples and figures.
FIG. 1 is a flowchart illustrating an implementation of a target tracking method through hierarchical feature response fusion according to an embodiment of the present disclosure;
FIG. 2 is a schematic flowchart of a target tracking method through hierarchical feature response fusion according to an embodiment of the present disclosure;
fig. 3 is a graph of tracking accuracy for an embodiment of the present description for 100 video sequences in an OTB-2015 data set;
fig. 4 is a graph of success rate of the embodiments of the present specification focusing on 100 video sequences in OTB-2015 data;
fig. 5 is a graph of tracking accuracy for the illumination variation property of an embodiment of the present specification in an OTB-2015 dataset over 100 video sequences;
fig. 6 is a graph of success rate of the illumination variation attribute of the embodiments of the present specification in OTB-2015 dataset over 100 video sequences;
fig. 7 is a graph of tracking accuracy for the scale-change property of an embodiment of the present specification in an OTB-2015 dataset over 100 video sequences;
fig. 8 is a graph of success rate of the embodiments of the present specification for the scale-change property of 100 video sequences in the OTB-2015 dataset;
fig. 9 is a tracking accuracy graph of rotation attributes in the plane of 100 video sequences in an OTB-2015 data set for an embodiment of the description;
fig. 10 is a graph of success rate of rotation attribute in the plane of 100 video sequences in OTB-2015 dataset for an embodiment of the present description;
fig. 11 is a tracking accuracy graph of the background blur property of the OTB-2015 data set in 100 video sequences for an embodiment of the specification;
fig. 12 is a graph of the success rate of the background blur property of the present invention in OTB-2015 data set on 100 video sequences.
FIG. 13 is a graph of tracking accuracy for occlusion attributes of an embodiment of the present specification in an OTB-2015 dataset over 100 video sequences;
fig. 14 is a graph of the success rate of the occlusion property of the present invention in OTB-2015 data set to 100 video sequences.
Detailed Description
Referring to fig. 1 and fig. 2, a target tracking method through hierarchical feature response fusion provided in an embodiment of the present specification may include the following steps:
step 10, initializing parameters, wherein the parameters comprise: hierarchical correlation filter W t l I =3,4,5}, filter regularization term weight factors lambda and lambda 1, a tracking model learning factor eta, a scale series S, a scale increment factor theta, a weight updating parameter tau, a fixed threshold delta and a weight factor alpha;
step 20, extracting the hierarchical characteristics of the target image, and performing response value fusion to obtain a position model; the method specifically comprises the following steps:
step 201, during the t frame of the video sequence, the target point P (x) is used t ,y t ) Obtaining a region with a set size as a target sample for the center, putting the target sample into a convolutional neural network, and extracting convolution characteristics of 3 layers, 4 layers and 5 layers to obtain characteristic images of 3 layers, 4 layers and 5 layers;
step 202, learning on 3,4 and 5 layers of a convolutional neural network respectively to obtain three initial context correlation filters;
respectively carrying out cyclic shift on the 3,4 and 5 layers of feature images to form a training sample to obtain a data matrix and expected output, and optimizing an initial context-dependent filter w' by using the data matrix and the expected output, as shown in formula (1), to obtain an optimized context-aware filter:
Figure BDA0002308853270000081
wherein w is an optimized context-dependent filter, w' is an initial context-dependent filter, λ 1 is a regularization term weight factor, U 0 Is the data matrix and y is the desired output;
the obtained optimized context-aware filter is a context-aware filter which has high response to the target image block and has near zero response to the context image block;
step 203, convolving the three optimized context-aware filters with the feature images of the 3,4 and 5 layers, obtaining a response vector of the feature image by using a formula (2), and then searching a position corresponding to the feature image with the maximum response vector, namely the predicted position of the tracking target;
Figure BDA0002308853270000082
wherein z is a feature image, w is an optimized context-aware filter, z is a feature image,
Figure BDA0002308853270000083
is an inverse Fourier transform, is a dot product between matrix elements, and f (z) is a response vector of the eigen image;
step 204, updating the position parameters by adopting a linear interpolation method so as to update the position model, wherein the updating of the position parameters is as in formulas (3 a) and (3 b):
Figure BDA0002308853270000084
Figure BDA0002308853270000085
whereinI is the sequence number of the current frame, η is the tracking model learning factor,
Figure BDA0002308853270000086
evaluating the closed solution in Fourier domain for training sample parameters by using the properties of circulant matrix, and based on the evaluation of the closed solution in Fourier domain>
Figure BDA0002308853270000087
An updated location model for the target sample; the position model is updated as->
Figure BDA0002308853270000088
And/or>
Figure BDA0002308853270000089
Updating of (1);
step 205, recording three output response values obtained by convolving the feature images of the 3,4 and 5 layers with the optimized context correlation filter as R context3 、R context4 And R context5 Then, the response value weight of each layer at t frames is normalized, as in the following formulas (4 a), (4 b), (4 c):
Figure BDA00023088532700000810
Figure BDA00023088532700000811
Figure BDA00023088532700000812
the filter response value takes a larger proportion and is distributed with a higher weight, and the initial response value weight is updated by using the response value weight of the t frame, and the following formulas (5 a), (5 b) and (5 c) are used:
Figure BDA0002308853270000091
Figure BDA0002308853270000092
Figure BDA0002308853270000093
where τ is the weight update parameter, context3_ w t 、context4_w t And context5_ w t Representing the initial response value weight at t frames;
step 206, in the t frame, the response values of the feature images of the 3,4 and 5 layers are fused through the formula (6) to obtain a fused output response value R t Obtaining a final position model, and obtaining the position of a tracking target according to the final position model;
Figure BDA0002308853270000094
wherein context3_ w t 、context4_w t And context5_ w t Denotes the initial response value weight, R, at t frames context3 、R context4 And R context5 And the characteristic images of 3,4 and 5 layers are respectively convolved with the optimized context correlation filter to obtain output response values.
Step 30, training the maximum scale response value of the scale correlation filter to obtain a scale model; the method specifically comprises the following steps:
step 301, setting the size of the extraction scale evaluation target image block as:
Figure BDA0002308853270000095
p, R is the width and height of the previous frame of the target sample; theta is a scale increment factor; s is a scale progression;
step 302, minimizing the cost function through the formula (8) to obtain the scale correlation filter:
Figure BDA0002308853270000096
wherein epsilon is a cost function, h is a scale correlation filter, g is an ideal correlation output, l represents the dimension of the feature, lambda is a regular term weight factor, f is a response vector of the feature image, and d is the feature dimension number;
step 303, solving the formula (8) in the frequency domain, as formula (9), for estimating the target scale:
Figure BDA0002308853270000101
where H is a scale-dependent filter in the frequency domain, l represents the dimension of the feature, H l A scale dependent filter in dimension I; f k For the kth training sample, F l Is the training sample of the l-th dimension, G is the ideal correlation output,
Figure BDA0002308853270000102
complex conjugate for ideal correlation output>
Figure BDA0002308853270000103
Is the complex conjugate of the kth training sample, lambda is a regular term weight factor, t is a frame number, and d and k are feature dimension degrees;
step 304, for obtaining robust result, for H in formula (9) l And respectively updating the numerator denominators so as to update the scale model:
Figure BDA0002308853270000104
Figure BDA0002308853270000105
wherein eta is a tracking model learning factor, F t k Is as followsThe number of k training samples is then,
Figure BDA0002308853270000106
for the complex conjugate of the kth training sample, G t For ideal correlation output, F t l The training sample is the training sample of the l-th dimension, lambda is the regular term weight factor, t is the frame number, l is the dimension, d and k are the feature dimension degree; the scale model update is the update>
Figure BDA0002308853270000107
And E t And (4) updating.
In the next frame, step 305, the response value of the scale-dependent filter can be determined by solving equation (11):
Figure BDA0002308853270000108
wherein Z is a set of feature images Z;
through the steps, the accurate scale estimation method is realized, and the adaptability to target scale change is improved.
Step 40, when the fusion response value obtained after the response value fusion in the step 20 is less than or equal to the set threshold, re-detecting the target image to obtain a candidate region, and returning to the step 20; when the fusion response value is larger than a set threshold value, updating the position model and the scale model, and then entering step 50; the method specifically comprises the following steps:
step 401, determining the fusion output response value R obtained in step 20 t If the difference is smaller than or equal to the fixed threshold value delta, executing a step 402 if the difference is larger than the fixed threshold value delta, and executing a step 403 if the difference is not larger than the fixed threshold value delta; only when the fusion response value calculated by the formula (6) is less than or equal to the fixed threshold, the situation that the tracking effect is not good or the tracking failure occurs is indicated, and re-detection is needed;
step 402, generating a plurality of candidate regions C on the whole image by the EdgeBox redetector d Then, a confidence value of the candidate region is calculated, and a learning rate is used to learn the optimized context senseKnowing the filter, aiming at maintaining a long-time appearance change memory, obtaining the maximum response value as g (c), then selecting an optimal candidate area as a re-detection result by minimizing formula (12), and then returning to step 20;
Figure BDA0002308853270000111
wherein g (c) represents a maximum response value, α represents a weighting factor,
Figure BDA0002308853270000112
representing candidate regions in t frames, D representing each
Figure BDA0002308853270000113
And a bounding box c t-1 The center position distance of (a);
step 403, updating the position model by using formulas (3 a) and (3 b), updating the scale model by using formulas (10 a) and (10 b), and then entering step 50.
The EdgeBox is introduced as a re-detector in the tracking process and used for processing the condition of tracking failure and improving the tracking robustness.
And step 50, using the updated position model and the updated scale model for tracking the next frame, and returning to the step 40.
Please refer to fig. 3 to 14, which are the results of graphs automatically generated by matlab software. Fig. 3-14 compare in various ways the tracking accuracy and tracking success rate of the method (deployed) of the embodiments of the present specification with other target tracking methods or algorithms, including CNN-SVM, stage _ CA, C-COT _ HOG, SAMF _ AT, stage, CFNet _ conv3, SRDCF, LMCF, siamCF, SAMF _ CA, LCT, DSST, and KCF. The contents of the boxes on the right of fig. 3 to 14, from top to bottom, show the method (or algorithm) going from good to bad. As can be seen from fig. 3 to fig. 14, the method of the embodiment of the present specification is substantially at the first position in the 100 video sequences in the OTB-2015 data set, and the method of the embodiment of the present specification has great advantages in terms of tracking accuracy and tracking success rate compared with other methods.
The meaning of the accuracy plots in fig. 3 to 14 is: in the tracking accuracy evaluation, one widely used criterion is a center position error, which is defined as an average euclidean distance between the center position of the tracking target and an accurate position that is manually calibrated. The accuracy map can show the percentage of frames in the total number of frames for which the estimated position is within a threshold distance of a given accuracy value.
The meaning of the success rate graphs in fig. 3 to 14 is: in the success rate evaluation, the evaluation criterion is the overlapping rate of bounding boxes. Assume the bounding box of the trace is γ t The exact bounding box is γ a The overlap ratio is defined as S = | γ t ∩γ a |/|γ t ∪γ a And | in which |, n and £ respectively represent the intersection and union of two regions, and | |, refers to the number of pixels in its region. To gauge the performance of the algorithm over a series of frames, we calculate that the overlap ratio S is greater than a given threshold t o The number of successful frames. The success rate map gives the proportion of successful frames when the threshold is varied from 0 to 1.
In the embodiment of the specification, a convolutional neural network is adopted to extract image features, a context-dependent filter is combined to output corresponding response values, and the three response values are subjected to adaptive fusion, so that a position model can well predict the position of a target object; a scale-dependent filter is adopted to carry out rapid scale estimation on the target, so that the scale transformation capability is improved to a certain extent, and the tracking accuracy is improved; and combining an EdgeBox redetector, when the tracking fails, the tracking drifts and the like, redetecting the target image by using the redetector, and updating the position model and the scale model when the conditions are met.
Although specific embodiments of the invention have been described above, it will be understood by those skilled in the art that the specific embodiments described are illustrative only and are not limiting upon the scope of the invention, and that equivalent modifications and variations can be made by those skilled in the art without departing from the spirit of the invention, which is to be limited only by the appended claims.

Claims (2)

1. A target tracking method through hierarchical feature response fusion, comprising:
step 10, initializing parameters;
step 20, extracting the layered features of the target image, and performing response value fusion to obtain a position model, wherein the step comprises the following steps:
step 201, when the t frame of the video sequence is, the target point P (x) t ,y t ) Obtaining a region with a set size as a target sample for the center, putting the target sample into a convolutional neural network, and extracting convolution characteristics of 3 layers, 4 layers and 5 layers to obtain characteristic images of 3 layers, 4 layers and 5 layers;
step 202, learning on 3,4 and 5 layers of a convolutional neural network respectively to obtain three initial context correlation filters;
respectively carrying out cyclic shift on the characteristic images of the 3,4 and 5 layers to form a training sample to obtain a data matrix and expected output, and optimizing an initial context correlation filter w' by using the data matrix and the expected output, as shown in formula (1), to obtain an optimized context correlation filter:
Figure FDA0004071844120000011
wherein w is an optimized context-dependent filter, w' is an initial context-dependent filter, λ 1 is a regularization term weight factor, U 0 Is the data matrix and y is the desired output;
step 203, convolving the three optimized context-dependent filters with the feature images of the 3,4 and 5 layers, obtaining a response vector of the feature image by using a formula (2), and then searching a position corresponding to the feature image with the maximum response vector, namely the predicted position of the tracking target;
Figure FDA0004071844120000012
wherein z is a characteristic image,
Figure FDA0004071844120000013
is an inverse Fourier transform, is a dot product between matrix elements, and f (z) is a response vector of the eigen image;
step 204, updating the position parameters by adopting a linear interpolation method so as to update the position model, wherein the updating of the position parameters is as in formulas (3 a) and (3 b):
Figure FDA0004071844120000014
Figure FDA0004071844120000015
wherein i is the sequence number of the current frame, η is the learning factor of the tracking model,
Figure FDA0004071844120000016
the closed solution in the Fourier domain is determined for the training sample parameters by using the properties of the circulant matrix, and the result is evaluated>
Figure FDA0004071844120000021
An updated location model for the target sample;
step 205, recording three output response values obtained by convolving the feature images of the 3,4 and 5 layers with the optimized context correlation filter as R context3 、R context4 And R context5 Then, the response value weight of each layer at t frames is normalized as follows, formulas (4 a), (4 b), (4 c):
Figure FDA0004071844120000022
Figure FDA0004071844120000023
Figure FDA0004071844120000024
updating the initial response value weight by using the response value weight of the t frame according to the following formulas (5 a), (5 b) and (5 c):
Figure FDA0004071844120000025
Figure FDA0004071844120000026
/>
Figure FDA0004071844120000027
where τ is a weight update parameter, context3_ w t 、context4_w t And context5_ w t Representing the initial response value weight at t frames;
step 206, in the t frame, the response values of the feature images of the 3,4 and 5 layers are fused through the formula (6) to obtain a fused output response value R t Obtaining a final position model, and obtaining the position of a tracking target according to the final position model;
Figure FDA0004071844120000028
wherein context3_ w t 、context4_w t And context5_ w t Denotes the initial response value weight, R, at t frames context3 、R context4 And R context5 Convolving the feature images of 3,4 and 5 layers with an optimized context-dependent filter to obtain output response values;
step 30, training the maximum scale response value of the scale correlation filter to obtain a scale model, including:
step 301, setting the block size of the extraction scale evaluation target image as:
Figure FDA0004071844120000031
p, R is the width and height of the previous frame of the target sample; theta is a scale increment factor; s is a scale progression;
step 302, minimizing the cost function through the formula (8) to obtain the scale correlation filter:
Figure FDA0004071844120000032
wherein epsilon is a cost function, h is a scale correlation filter, g is an ideal correlation output, l represents the dimension of the feature, lambda is a regular term weight factor, f is a response vector of the feature image, and d is the feature dimension number;
step 303, solving the formula (8) in the frequency domain, as formula (9), for estimating the target scale:
Figure FDA0004071844120000033
where H is a scale-dependent filter in the frequency domain, l represents the dimension of the feature, H l A scale-dependent filter in dimension l; f k For the kth training sample, F l Is the training sample of the l-th dimension, G is the ideal correlation output,
Figure FDA0004071844120000034
is the complex conjugate of the ideally correlated output, <' >>
Figure FDA0004071844120000035
Is the complex conjugate of the kth training sample, lambda is the regular term weight factor, t is the frame number, and d and k are the feature dimension number;
step 304, for obtaining robust result, for H in formula (9) l And respectively updating the numerator denominators so as to update the scale model:
Figure FDA0004071844120000036
Figure FDA0004071844120000037
wherein eta is a tracking model learning factor, F t k For the k-th training sample,
Figure FDA0004071844120000038
for the complex conjugate of the kth training sample, G t For ideal correlation output, F t l The training sample is the training sample of the l-th dimension, lambda is the regular term weight factor, t is the frame number, l is the dimension, d and k are the feature dimension degree; the scale model update is update D t l And E t Updating of (1); />
In the next frame, step 305, the response value of the scale-dependent filter can be determined by solving equation (11):
Figure FDA0004071844120000041
wherein Z is a set of feature images Z;
step 40, when the fusion response value obtained after the response value fusion in the step 20 is less than or equal to the set threshold, re-detecting the target image to obtain a candidate region, and returning to the step 20; when the fusion response value is larger than a set threshold value, updating the position model and the scale model, and then entering step 50; the method comprises the following steps:
step 401, determining the fusion output response value R obtained in step 20 t If the threshold value is less than or equal to the fixed threshold value delta, executing the step if the threshold value is less than or equal to the fixed threshold value deltaStep 402, if not, go to step 403;
step 402, generating a plurality of candidate regions C on the whole image by the EdgeBox redetector d Then calculate the confidence value of the candidate region, learn the optimized context correlation filter using a learning rate, obtain the maximum response value as g (c), then select an optimal candidate region as the re-detection result by minimizing equation (12), and then return to step 20,
Figure FDA0004071844120000042
wherein g (c) represents a maximum response value, α represents a weighting factor,
Figure FDA0004071844120000043
represents a candidate area in t frames, D represents each ^ R>
Figure FDA0004071844120000044
And a bounding box c t-1 The center position distance of (a);
step 403, updating the position model by using formulas (3 a) and (3 b), updating the scale model by using formulas (10 a) and (10 b), and then entering step 50;
and step 50, using the updated position model and the updated scale model for tracking the next frame, and returning to the step 40.
2. The method for tracking the target through the fusion of the hierarchical feature responses as claimed in claim 1, wherein: in step 10, the parameters include: hierarchical correlation filter W t l I l =3,4,5}, filter regularization term weight factors lambda and lambda 1, a tracking model learning factor eta, a scale series S, a scale increment factor theta, a weight updating parameter tau, a fixed threshold delta and a weight factor alpha.
CN201911250349.XA 2019-12-09 2019-12-09 Target tracking method through hierarchical feature response fusion Active CN111008996B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911250349.XA CN111008996B (en) 2019-12-09 2019-12-09 Target tracking method through hierarchical feature response fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911250349.XA CN111008996B (en) 2019-12-09 2019-12-09 Target tracking method through hierarchical feature response fusion

Publications (2)

Publication Number Publication Date
CN111008996A CN111008996A (en) 2020-04-14
CN111008996B true CN111008996B (en) 2023-04-07

Family

ID=70115126

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911250349.XA Active CN111008996B (en) 2019-12-09 2019-12-09 Target tracking method through hierarchical feature response fusion

Country Status (1)

Country Link
CN (1) CN111008996B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111640138B (en) * 2020-05-28 2023-10-27 济南博观智能科技有限公司 Target tracking method, device, equipment and storage medium
CN111612001B (en) * 2020-05-28 2023-04-07 华侨大学 Target tracking and positioning method based on feature fusion
CN111968156A (en) * 2020-07-28 2020-11-20 国网福建省电力有限公司 Adaptive hyper-feature fusion visual tracking method
CN113537001B (en) * 2021-07-02 2023-06-23 安阳工学院 Vehicle driving autonomous decision-making method and device based on visual target tracking
CN113610891B (en) * 2021-07-14 2023-05-23 桂林电子科技大学 Target tracking method, device, storage medium and computer equipment
CN113537241B (en) * 2021-07-16 2022-11-08 重庆邮电大学 Long-term correlation filtering target tracking method based on adaptive feature fusion

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015163830A1 (en) * 2014-04-22 2015-10-29 Aselsan Elektronik Sanayi Ve Ticaret Anonim Sirketi Target localization and size estimation via multiple model learning in visual tracking
CN108549839A (en) * 2018-03-13 2018-09-18 华侨大学 The multiple dimensioned correlation filtering visual tracking method of self-adaptive features fusion
CN109325966A (en) * 2018-09-05 2019-02-12 华侨大学 A method of vision tracking is carried out by space-time context
CN109816689A (en) * 2018-12-18 2019-05-28 昆明理工大学 A kind of motion target tracking method that multilayer convolution feature adaptively merges

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015163830A1 (en) * 2014-04-22 2015-10-29 Aselsan Elektronik Sanayi Ve Ticaret Anonim Sirketi Target localization and size estimation via multiple model learning in visual tracking
CN108549839A (en) * 2018-03-13 2018-09-18 华侨大学 The multiple dimensioned correlation filtering visual tracking method of self-adaptive features fusion
CN109325966A (en) * 2018-09-05 2019-02-12 华侨大学 A method of vision tracking is carried out by space-time context
CN109816689A (en) * 2018-12-18 2019-05-28 昆明理工大学 A kind of motion target tracking method that multilayer convolution feature adaptively merges

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
常敏 等. 自适应特征融合和模型更新的相关滤波跟踪.《光学学报》.2019,全文. *
陈智 等. 自适应特征融合的多尺度相关滤波目标跟踪算法.《计算机辅助设计与图形学学报》.2018,全文. *

Also Published As

Publication number Publication date
CN111008996A (en) 2020-04-14

Similar Documents

Publication Publication Date Title
CN111008996B (en) Target tracking method through hierarchical feature response fusion
CN111354017B (en) Target tracking method based on twin neural network and parallel attention module
CN108549839B (en) Adaptive feature fusion multi-scale correlation filtering visual tracking method
CN108830285B (en) Target detection method for reinforcement learning based on fast-RCNN
CN109741366B (en) Related filtering target tracking method fusing multilayer convolution characteristics
CN107424177B (en) Positioning correction long-range tracking method based on continuous correlation filter
Kwon et al. Highly nonrigid object tracking via patch-based dynamic appearance modeling
CN110135500B (en) Target tracking method under multiple scenes based on self-adaptive depth characteristic filter
CN110472594B (en) Target tracking method, information insertion method and equipment
CN111260738A (en) Multi-scale target tracking method based on relevant filtering and self-adaptive feature fusion
CN111260688A (en) Twin double-path target tracking method
CN111310582A (en) Turbulence degradation image semantic segmentation method based on boundary perception and counterstudy
CN109035300B (en) Target tracking method based on depth feature and average peak correlation energy
CN110276784B (en) Correlation filtering moving target tracking method based on memory mechanism and convolution characteristics
CN110909591B (en) Self-adaptive non-maximum suppression processing method for pedestrian image detection by using coding vector
CN109325966B (en) Method for carrying out visual tracking through space-time context
CN109308713B (en) Improved nuclear correlation filtering underwater target tracking method based on forward-looking sonar
CN109903315B (en) Method, apparatus, device and readable storage medium for optical flow prediction
CN110660080A (en) Multi-scale target tracking method based on learning rate adjustment and fusion of multilayer convolution features
CN110992288B (en) Video image blind denoising method used in mine shaft environment
CN110992401A (en) Target tracking method and device, computer equipment and storage medium
CN111583311A (en) PCBA rapid image matching method
CN111340842A (en) Correlation filtering target tracking algorithm based on joint model
CN113052873A (en) Single-target tracking method for on-line self-supervision learning scene adaptation
CN107657627B (en) Space-time context target tracking method based on human brain memory mechanism

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant