CN113538509A - Visual tracking method and device based on adaptive correlation filtering feature fusion learning - Google Patents

Visual tracking method and device based on adaptive correlation filtering feature fusion learning Download PDF

Info

Publication number
CN113538509A
CN113538509A CN202110615306.8A CN202110615306A CN113538509A CN 113538509 A CN113538509 A CN 113538509A CN 202110615306 A CN202110615306 A CN 202110615306A CN 113538509 A CN113538509 A CN 113538509A
Authority
CN
China
Prior art keywords
feature
response
adaptive
fusion
tracking
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110615306.8A
Other languages
Chinese (zh)
Other versions
CN113538509B (en
Inventor
朱鹏飞
于洪涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN202110615306.8A priority Critical patent/CN113538509B/en
Publication of CN113538509A publication Critical patent/CN113538509A/en
Application granted granted Critical
Publication of CN113538509B publication Critical patent/CN113538509B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20221Image fusion; Image merging
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30241Trajectory

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a visual tracking method and a visual tracking device based on self-adaptive correlation filtering feature fusion learning, wherein the method comprises the following steps: the two feature extractors extract the input tracking image into shallow and deep feature representations, the related filtering module completes the learning and updating of the related filter coefficient with the help of self-adaptive time sequence smoothing punishment, and then the target feature graph and the updated related filter perform related operation to obtain the response graph of the shallow and deep features; the feature fusion weight output at the last moment of the self-adaptive feature response fusion learning module and the response graph of the shallow layer and the depth feature are fused together in a weighted mode to form a final target feature response graph, and the feature fusion weight output at the current moment is used for fusing the feature response graph of the next frame; and based on the final target characteristic response diagram, obtaining the scale of the current target by utilizing multi-scale search, and realizing visual tracking. The device comprises: a processor and a memory. The invention effectively solves the challenging problems such as shielding, illumination change, complex background and the like in the tracking process.

Description

Visual tracking method and device based on adaptive correlation filtering feature fusion learning
Technical Field
The invention relates to the field of visual tracking, in particular to a robust visual tracking method and device based on adaptive correlation filtering feature fusion learning.
Background
Visual target tracking is a challenging task in the field of computer vision. It has many practical applications at present, for example: traffic monitoring, human-computer interaction, autonomous robots, autopilot, and the like. Although the existing tracking method is remarkably improved in the aspects of accuracy and robustness, the problems of illumination change, target shielding, target deformation, background confusion and the like are still not well solved.
In recent years, tracking models based on Discriminant Correlation Filters (DCFs) have received increasing attention due to their high accuracy and high efficiency. The model updating method based on multi-feature fusion and time sequence constraint obviously improves the tracking precision and robustness of the tracking model based on the DCFs. Although great progress has been made in the existing DCFs-based trackers, there are two limitations, namely the non-robustness of feature representation and poor adaptability to various scenes.
The characteristic response diagram is the correlation operation result between the target characteristic diagram and the discriminant filter, and the quality of the response diagram has a great influence on the target positioning precision. The rich feature response provides an accurate and robust representation of the object for visual tracking. Currently, a common method for feature response map fusion is to integrate multiple response maps together by equally sized fusion weights. This coarse multi-feature response combination, while improving the quality of the target representation to some extent, also undermines the original advantages of each response map. For example, the response of a manual feature of oriented gradient Histogram (HOG) contains a spike of target response that can be used for accurate localization, but the target localization usually has disturbances like object responses. In contrast, response maps expressed by Convolutional Neural Networks (CNNs) have high-confidence target semantics, but they are only suitable for rough target spatial localization. The naive addition of the two reduces the discrimination and accurate positioning of the target expressed in the original response diagram. The STAple (tracking method based on the fusion of the response graphs of the color histogram and the histogram of oriented gradients) model selects a feature fusion weight by a contrast test setting method, and then the weight combines the HOG feature and the color histogram to fully utilize the discrimination of the HOG feature response on the target and the robustness of the color histogram on the deformation of the target. But the set weight is selected through a large number of comparison experiments on the verification set, and is difficult to popularize in other scenes with different tracking challenge attributes. And MCCT (fusion tracking method combining response image self-evaluation and expert strategy) and UPDT (fusion tracking method of shallow and deep features based on response image confidence evaluation) models solve the fusion weight of the shallow and deep feature responses by using the self-evaluation value of the current target feature response. The feature fusion method based on self-evaluation avoids additional cost consumption of model fine tuning, and tracking accuracy is superior to fixed fusion weight setting. However, the evaluation of the response map fusion is greatly influenced by the current tracking quality, and as the tracking error of the model is accumulated, the tracking drift is easily generated.
In addition, timing constraints can maintain reliable model updates, thereby ensuring that robust target feature responses are generated. The model corruption caused by unreliable target representation can be overcome by the time-series smoothing, and particularly, the time-series smoothing punishment influences the direction and the speed of model updating so as to resist the challenging problems of target shielding or appearance change and the like in a short time. STRCF[1]The tracking model (the space-time regularization correlation filtering tracking method) reduces model corruption caused by target occlusion by introducing time sequence regularization into a tracker based on DCFs. In consideration of the interference of background noise, the ARCF (correlation filter tracking method based on time series distortion suppression) model imposes a consistency constraint on the feature response map between adjacent frames in the filter learning process, and therefore, the filter coefficients can be updated and learned according to the target information rather than the background information. In addition, the LADCF (correlation filtering tracking method based on time sequence consistency constraint) model realizes the time sequence consistency of the model by limiting the updating of the filter near the historical value, so that the obtained target representation has higher robustness to the time sequence appearance change of a scene or a target.
Although several of the above methods achieve some tracking performance improvement, the fixed temporal smoothing penalty ignores the variability of the tracking situation and can only be applied to a few tracking scenarios. Furthermore, video sequences used for tracking typically contain a wide variety of scene changes, such as: illumination changes, occlusion, etc.
To this end, two challenges facing the correlation filtering based tracking model in the appearance stage are summarized: how to form a robust target response, and how to maintain a good model update.
Disclosure of Invention
The invention provides a robust visual tracking method and a robust visual tracking device based on adaptive correlation filtering feature fusion learning, which effectively solve the challenging problems such as shielding, illumination change, complex background and the like in the tracking process and are described in detail in the following:
in a first aspect, a robust visual tracking method based on adaptive correlation filtering feature fusion learning includes the following steps:
the two feature extractors extract the input tracking image into shallow and deep feature representations, the related filtering module completes the learning and updating of the related filter coefficient with the help of self-adaptive time sequence smoothing punishment, and the target feature graph and the updated related filter perform related operation to obtain the response graph of the shallow and deep features;
the feature fusion weight output at the last moment of the self-adaptive feature response fusion learning module and the response graph of the shallow layer and the depth feature are fused together in a weighted mode to form a final target feature response graph, and the feature fusion weight output at the current moment is used for fusing the feature response graph of the next frame;
and based on the final target characteristic response diagram, obtaining the scale of the current target by utilizing multi-scale search, and realizing visual tracking.
In one embodiment, the adaptive time-series smoothing correlation filtering module obtains the degree of variation of the current target and scene by using the overall quality difference and confidence difference of the target feature response image between adjacent video frames as a basis;
and designing a time sequence smooth punishment for self-adaptive adjustment of the intensity by utilizing the obtained variation degree of the target and the scene, and using the time sequence smooth punishment for reliably updating the self-adaptive correlation filtering model.
In an embodiment, the time-sequence smoothing penalty for adaptively adjusting the magnitude of the intensity is specifically:
Figure BDA0003097797410000031
wherein the function e(.)The first term | R ofmax,t-1-RmaxL represents the maximum confidence interval between adjacent frames; r | |t-1-R||2Representing the difference in overall response quality between adjacent frames; gamma is a hyperparameter; r ═ betat⊙RD+(I-βt)⊙RSRepresenting the fusion of weights beta by adaptationtIntegrated characteristic response, and betatThe self-adaptive feature response fusion learning module learns the feature; rt-1Representing the response fusion of the previous frame; rmax,t-1And RmaxRepresenting the maximum confidence of the previous and current frames.
The adaptive feature response fusion learning module introduces directional gradient histograms and convolutional neural network features to integrate feature responses; the desired correlation filter is solved by two independent correlation filtering training processes.
In one embodiment, the adaptive feature response fusion learning module learns optimal weights for histogram of oriented gradients and convolutional neural network feature response map fusion using gaussian response truth values as a supervised ridge regression method.
Wherein the content of the first and second substances,
the optimal weight is expressed as:
Figure BDA0003097797410000032
wherein, betatThe optimal characteristic response fusion matrix is represented;
Figure BDA0003097797410000033
is betatReference value of, providePriori information and model degradation can be avoided; μ is a hyperparameter.
In a second aspect, a robust visual tracking apparatus based on adaptive correlation filter feature fusion learning, the apparatus comprising: a processor and a memory, the memory having stored therein program instructions, the processor calling the program instructions stored in the memory to cause the apparatus to perform the method steps of any of the first aspects.
In a third aspect, a computer-readable storage medium storing a computer program comprising program instructions which, when executed by a processor, cause the processor to carry out the method steps of any one of the first aspect.
The technical scheme provided by the invention has the beneficial effects that:
1. the method adopts a supervised learning mode to fuse two characteristics of depth and manual work, effectively utilizes complementary characteristics among various characteristics, is more intelligent and effective compared with the existing method of fusing various characteristics by utilizing a fixed weight or a response graph self-evaluation strategy, and can integrate a higher-quality characteristic response graph for target positioning;
2. aiming at the inherent time sequence characteristic of visual target tracking, the invention provides the time sequence smooth constraint of self-adaptive adjustment according to the real-time tracking quality change condition on the basis of the existing time sequence smooth of fixed intensity, and the constraint method has better self-adaptability and generalization;
3. numerous experiments on five challenging visual tracking benchmarks have demonstrated the superiority of the present invention over other advanced methods, such as: with the latest tracking method ECO (efficient deconvolution-based tracking method) and STRCF[1]Compared with a space-time regularization correlation filtering tracking method, the AFLCF respectively obtains 1.9% and 4.4% AUC (tracking target overlap ratio precision) score promotion on the LaSOT data basis, and in addition, a visualization experiment and a tracking scene attribute experiment also verify that the method can effectively cope with challenging problems of target deformation, shielding, illumination variation, camera motion, target rapid movement, background clutter and the likeTitle to be obtained;
4. the invention is updated on line, does not need massive labeled training data, has small model scale and low requirement on hardware environment, and can be applied to mobile terminal equipment, such as: unmanned aerial vehicles, unmanned vehicles, autonomous robots, and the like.
Drawings
FIG. 1 is a flow chart of a robust visual tracking method based on adaptive correlation filter feature fusion learning;
FIG. 2 is a schematic diagram of a video sequence with scene and target changes used for evaluating IOU (target overlap ratio) scores in each frame of video under different strength of time-sequence smoothness penalties;
when the scene or the target (the head of the rider in the image) does not change obviously, a higher tracking IOU score can be obtained by a higher-intensity time sequence smoothing penalty, for example, two pictures (adjacent 40 th and 41 th frames in the video sequence) in the first column in the image are provided, the target moves smoothly, and the scene does not change obviously; when the target or scene changes drastically, the smaller intensity of the temporal smoothing penalty achieves a higher tracking IOU score, for example, in the second column of pictures (69 th and 70 th frames of the video sequence) in the present illustration, the target has obvious fast motion and severe motion blur. The figure illustrates that the larger time sequence smoothing punishment is only suitable for the condition that the target and the scene have no obvious variation, and the smaller time sequence smoothing punishment can obtain higher tracking performance under the condition that the target or the background has severe variation. Therefore, a time sequence smooth punishment capable of self-adaptive adjustment according to the change conditions of the tracking scene and the target is designed, and the tracking performance of the model can be effectively improved.
FIG. 3 is a comparison of the performance test of the present method with 18 advanced tracking methods on the basis of OTB-2015 data;
the left graph shows the distance precision comparison of each tracking method, and the numerical value after the name of the tracking method represents the distance precision value obtained by the tracking method; the right graph shows the comparison of the overlapping success rate of each tracking method, and the numerical value after the name of the tracking method represents the obtained overlapping success rate value. Experiments show that the tracking method AFLCF corresponding to the invention achieves the leading tracking performance.
FIG. 4 is a diagram illustrating performance comparison between the present method and nine advanced tracking methods in a variety of scenarios with different tracking challenges;
the values in the figure after tracing the method name indicate the overlap success rate values they achieved. Wherein, mainly include: illumination variation, target deformation, shielding, camera motion, target fast moving, background clutter and other challenging problems. From the experimental test results, the method is obviously superior to the existing tracking method and has better robustness for various tracking challenge problems.
Fig. 5 is a schematic structural diagram of a robust visual tracking device based on adaptive correlation filtering feature fusion learning.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below.
Aiming at the problems, the embodiment of the invention provides a tracking model based on DCFs adaptive feature response fusion learning. The model adopts a method of self-adaptive supervised learning to replace manual setting or self-evaluation to solve the weight for fusing various features, and then integrates the response graphs of the shallow feature HOG and the deep feature CNN through the learned weight. The solved optimal fusion weight can fully excavate the complementary advantages of multi-feature response, thereby obtaining robust target representation. Meanwhile, the design of self-adaptive time smoothing punishment can ensure that the model can still be reliably updated in a scene with motion change.
Fig. 1 shows an overall structural diagram of a tracking method provided by an embodiment of the present invention. The embodiment of the invention uses the direction gradient histogram and the convolutional neural network as extractors of shallow and deep features; updating the model and solving the filter coefficients using a correlation filtering method with an adaptive timing smoothing penalty; learning the feature response graph and the feature of the convolutional neural network of the direction gradient histogram in the current frame by using an adaptive feature response fusion learning moduleOptimal fusion weights for the response map. The solved fusion weights are then used for the integration of the multi-feature response map in the next frame. Time sequence smoothing correlation filtering tracking method STRCF with fixed intensity[1]Is a baseline model of an embodiment of the invention.
Wherein, STRCF[1]The target function of (2) is shown in equation (1), and for the sake of simplicity, the index t corresponding to the current time is uniformly omitted (e.g., x)tDenoted as x),
Figure BDA0003097797410000061
wherein x is [ x ]1,...,xk,...xK]Representing an input feature map, x, consisting of K channelskRepresenting the feature of the kth channel in the input feature map. w denotes a correlation filter to be learned. Denotes the convolution operator. y is a gaussian target characteristic response true value (which is a default true value in machine learning, and is not described in detail in the embodiments of the present invention). The second term is spatial regularization referenced from SRDCF (correlation filter tracking model based on spatial regularization), M being a spatial weight of the gaussian, k being the index of the kth channel of the correlation filter. An element dot-product operation. I w-wt-1||2Is a time smoothing constraint to prevent model damage, λ is a penalty factor, wt-1The correlation filter at the previous time instant.
Adaptive feature response fusion learning
Robust object representations with explicit semantic and spatial details enable trackers to overcome various object changes. The tracker introduces Histogram of Oriented Gradient (HOG) and Convolutional Neural Network (CNNs) features to integrate feature responses. CNNs have more characteristic channels than HOG, and therefore, two independent correlation filter training processes are required to solve for the desired correlation filter. The training sample is
Figure BDA0003097797410000062
And
Figure BDA0003097797410000063
the filter corresponding to the solution required is denoted wDAnd wS
Specifically, xDTraining samples for the expression of CNNs characteristics, xSTraining samples for HOG feature expression, wDTo correspond to xDFilter of wSTo correspond to xSThe filter of (2). The characteristic response map is written as the convolution of sample x and filter w, as shown in equation (2), RDAnd RSCharacteristic response maps corresponding to CNNs and HOG are shown, respectively.
To facilitate subsequent fusion to the feature response graph, RDAnd RSWill be resized to the same size.
Figure BDA0003097797410000064
Wherein, K1The number of channels of the CNNs characteristics is set to 512 in the embodiment of the invention; k2The number of channels with the HOG feature is set to 31 in the embodiment of the present invention, and in the specific implementation, the embodiment of the present invention does not limit this, and is set according to the needs in the practical application.
Innovatively, the AFLCF (tracking method based on adaptive correlation filter feature response fusion learning) model in the embodiment of the present invention learns an optimal weight for HOG and CNNs feature response map fusion by using gaussian response truth value y as a supervised ridge regression method,
the corresponding objective function can be expressed as:
Figure BDA0003097797410000071
wherein, betatThe optimal eigenresponse fusion matrix is represented. While
Figure BDA0003097797410000072
Is betatIs a reference value of betatProvides a priori information and can avoid model degradation.
Is provided with
Figure BDA0003097797410000073
Where I is the identity matrix. μ is a hyperparameter. The second term in the above objective function is a regularization term that is used to prevent overfitting of the model.
To improve computational efficiency, Parseval's theorem can be utilized[2]Transform equation (2) into the fourier domain. Equation (2) can then be written as:
Figure BDA0003097797410000074
wherein the content of the first and second substances,
Figure BDA0003097797410000075
response map R characteristic of CNNsDThe representation in the fourier domain is such that,
Figure BDA0003097797410000076
as a feature x of CNNsDThe representation in the fourier domain is such that,
Figure BDA0003097797410000077
filters w corresponding to the characteristics of CNNsDThe representation in the fourier domain is such that,
Figure BDA0003097797410000078
response graph R for HOG featureSThe representation in the fourier domain is such that,
Figure BDA0003097797410000079
as a HOG feature xSThe representation in the fourier domain is such that,
Figure BDA00030977974100000710
filter w corresponding to HOG featuresSRepresentation in the fourier domain.
Accordingly, equation (3) translates to:
Figure BDA00030977974100000711
wherein the content of the first and second substances,
Figure BDA00030977974100000712
is a representation of gaussian response truth y in the fourier domain.
Finally, the objective function corresponding to equation (5) can be solved and β can be obtainedtAnalytic solution of (2):
Figure BDA00030977974100000713
wherein the content of the first and second substances,
Figure BDA00030977974100000714
t denotes transposition.
Division in equation (6)
Figure BDA00030977974100000715
Representing element-by-element division.
Second, adaptive timing smoothing
The correlation filtering based tracker uses a timing smoothing penalty to enforce the similarity of the current filter to the previous filter in order to prevent model damage. Intuitively, when a target encounters a sharp appearance change and occlusion, the required filter coefficients should not be too similar to the filter at the previous time. At this time, the intensity of the corresponding temporal smoothing penalty should be appropriately reduced, otherwise, the situation is the opposite.
To verify this assumption, the embodiments of the present invention compare numerically tested different strengths of the temporal smoothing penalty. As shown in fig. 2, for example, adjacent 40 th and 41 th frames in a video sequence do not experience significant target or scene variation, and then a larger penalty factor (λ ═ 20,30) achieves a better IOU (cross-over ratio, i.e., target overlap ratio) score than a weaker timing penalty (λ ═ 0, 1). In contrast, the 69 th and 70 th framesSignificant target fast motion and motion blur variations are experienced, so the timing penalty factor λ is 0 and 1 achieves a better IOU score than larger parameter settings (λ 20, 30). In addition, a large penalty also risks causing tracking drift in subsequent frames. The reason for this is that the larger penalty coefficient forces the adjacent filters to be too similar, and the model cannot adapt to a complex and variable scene. Therefore, the optimal setting scheme of λ should be able to adapt to the changing tracking scenario. On the basis of the analysis and experimental verification, the peak to side lobe ratio (PSR) is simultaneously obtained[3]Inspiring, the embodiment of the invention provides an adaptive time sequence smoothing punishment. The PSR is a measurement mode of the characteristic response quality, and can evaluate whether the current target characteristic response is good or not.
Figure BDA0003097797410000081
Wherein R is betat⊙RD+(I-βt)⊙RSRepresenting the fusion of weights beta by adaptationtIntegrated characteristic response, and betatThe self-adaptive feature response fusion learning module learns the feature; rt-1Representing the response fusion of the previous frame; gamma is a hyperparameter; rmax,t-1And RmaxRepresenting the maximum confidence of the previous and current frames.
Function e(.)The first term | R ofmax,t-1-RmaxAnd | represents the maximum confidence interval between adjacent frames. R | |t-1-R||2Representing the difference in overall response quality between adjacent frames. In practice, the maximum confidence and overall response quality may measure the accuracy and robustness of the tracker, respectively. The maximum confidence determines the localization of the target, while the overall response map with the target semantics affects the distinction between foreground and background. Thus, the change in performance of the tracker in adjacent frames may be accounted for by
Figure BDA0003097797410000082
To measure. It also indirectly reflects the current onesThe extent of scene changes in the frame. Therefore, the embodiments of the present invention summarize
Figure BDA0003097797410000083
Will become larger as the degree of tracking scene change increases and lambda will become smaller. Otherwise, the situation is reversed. The adaptive timing smoothing penalty enables adaptive changing tracking scenarios and maintains reliable model updates.
Third, tracing process description and implementing conditions
The working flow of the method is shown in fig. 1, and the adaptive correlation filtering tracking model is composed of two feature extractors (a direction gradient histogram and a convolution neural network), a correlation filtering model provided with adaptive time sequence smoothing, and an adaptive feature response fusion learning module. Wherein the feature extractor is for extracting shallow and deep feature representations of the input trace image. And the correlation filtering model is to use the STRCF with space-time regular constraint[1]Model as a base model in STRCF[1]On the basis of the fixed time sequence smoothness punishment, the embodiment of the invention improves the self-adaptive adjustment of the time sequence smoothness punishment, and indirectly evaluates the degree of sudden change of the tracking scene according to the current tracking quality of the model, thereby pertinently adjusting the strength of the time sequence smoothness punishment. Specifically, when the change of the scene or the target along the time sequence is obvious, the corresponding time sequence smoothing strength is reduced; conversely, when the scene change is moderate, the time sequence smoothing penalty intensity should be increased moderately.
In the tracking process, the embodiment of the invention can adaptively adjust the intensity of time sequence smoothing according to the tracking quality difference of the models in the adjacent video frames, thereby ensuring reliable model updating and relevant filter updating. The tracking image input into the model is extracted into shallow and deep feature representations, and then the response graph R of the shallow feature and the deep feature is obtained after the target feature graph is subjected to correlation operation with the updated correlation filterSAnd RD. In addition, the feature fusion weight β learned by the adaptive feature response fusion learning moduletFeature response map to be used in the next frameAnd fusing to obtain more robust target characterization. And the characteristic response map R in the current video frameSAnd RDThe fusion weight β to be learnedtAnd the weights are fused together to form the final target characteristic response. In the feature response graph after fusion, the position of the maximum response value is the position of the target, and the scale of the current target is obtained by utilizing a multi-scale searching method.
According to the embodiment of the invention, HOG is taken as a shallow layer characteristic, and a Conv4 layer in a VGG-M model is taken as a target depth characteristic to express. The target scale estimation is performed using a scale search range of 7 × 1.01. The number of ADMM iterations is set to 2. The hyperparameter μ in equation (3) is set to 0.01 and γ in equation (7) is set to 17. The model AFLCF was implemented on MATLAB2017a equipped with MatConvNet toolbox. All simulation experiments were performed on a PC equipped with Intel i7-8700CPU, 16G RAM and an NVIDIA GTX1050Ti GPU.
Simulation experiment results: to verify the effectiveness of the method AFLCF, comparisons were made with other most advanced trackers on five different types of visual target tracking benchmarks, these tracking benchmark datasets including: OTB-2015, VisDrone2019, Nfs, DTB70, and LaSOT. The evaluation of the experimental result mainly has two criteria of distance precision (Accuracy) and overlapping success rate (AUC). Fig. 3 shows a comparison of the performance of the AFLCF model and the 18 published tracking methods on OTB-2015 data basis, and table 1 lists the results of the above experimental comparisons compared to a tracking model with performance in the top five. Tables 2-5 show the performance of the AFLCF model compared to other advanced tracking methods on 4 data benchmarks total of VisDrone2019, Nfs, DTB70, and LaSOT. Fig. 4 shows performance comparison of the method and nine advanced tracking methods in six scenes with different tracking challenge problems, and experimental test results show that the method is obviously superior to the existing tracking method and has higher robustness. On the five data bases, the AFLCF tracking model provided by the invention obtains the top tracking precision, thereby effectively verifying the effectiveness of the method. In addition, the five data references used in the method belong to short-distance tracking, unmanned aerial vehicle tracking, high frame rate video tracking, high diversity tracking and large-scale scene tracking respectively, and the high generalization and adaptability of the method to different tracking scenes are verified by the extremely competitive tracking performance obtained by the method.
The comparative tracking method related to the invention is explained as follows: SRDCF (correlation filter tracking method based on spatial regularization); KCF (memory coring correlation filtering tracking method); fdst (multi-scale search based correlation filter tracking method); siamesfc (twin full convolution neural network based tracking method); repeat (tracking method based on fusion of response maps of color histogram and histogram of oriented gradient features); BACF (background-aware based tracking method); repeat-CA (tracking method of context-aware application to response map fusion of color histogram and histogram of oriented gradients features); DSiam (dynamic twin network based tracking method); PTAV (tracking method based on parallel tracking and backtracking verification); ECO (tracking method based on efficient decomposition convolution); ECO-HC (tracking method of high-efficiency decomposition convolution using only shallow features); STRCF[1](spatio-temporal regularization correlation filtering tracking method); DeepsTRCF[1](a space-time regularization correlation filtering tracking method based on depth features); MCCT (fusion tracking method combining response image self-evaluation and expert strategy); TRACA (context-aware tracking method in conjunction with depth features); CSRDCF (correlation filter tracking method based on channel and spatial confidence); UPDT-RCG (Gaussian mask is applied to a shallow and deep feature fusion tracking method based on response image confidence evaluation instead of cosine window); HCFT-star (hierarchical feature fusion based tracking method); ASRCF (adaptive spatial regularization based correlation filter tracking method); ARCF (correlation filter tracking method based on timing distortion suppression); AutoTrack (correlation filtering tracking method based on automatic spatio-temporal regularization).
TABLE 119 comparison of tracking Performance on OTB-2015 data basis, top five selected
Figure BDA0003097797410000101
Table 2 comparison of performance of various tracking methods on the data basis of VisDrone2019
Figure BDA0003097797410000102
TABLE 3 comparison of Performance of various tracking methods on NFs data base
Figure BDA0003097797410000103
Figure BDA0003097797410000111
TABLE 4 comparison of Performance of various tracking methods on DTB70 data base
Figure BDA0003097797410000112
TABLE 5 comparison of the Performance of the first five in all correlation-based filter tracking methods on a LaSOT data basis
Figure BDA0003097797410000113
The performance comparison of the AFLCF tracking method and nine advanced tracking methods in various tracking scenes with different challenge problems mainly comprises the following steps: illumination variation, target deformation, shielding, camera motion, target fast moving, background clutter and other challenging problems. From the experimental test results, the method is obviously superior to the existing tracking method and has better robustness for various tracking challenge problems.
Based on the same inventive concept, the embodiment of the invention also provides a robust visual tracking device based on self-adaptive correlation filtering feature fusion learning, which comprises: a processor and a memory, the memory having stored therein program instructions, the processor calling the program instructions stored in the memory to cause the apparatus to perform the method steps of:
the two feature extractors extract the input tracking image into shallow and deep feature representations, the related filtering module completes the learning and updating of the coefficients of the related filter and the like under the assistance of the self-adaptive time sequence smoothing penalty, and then the target feature graph and the updated related filter perform the related operation to obtain the response graph of the shallow and deep features;
the feature fusion weight output at the last moment of the self-adaptive feature response fusion learning module and the response graph of the shallow layer and the depth feature are fused together in a weighted mode to form a final target feature response graph, and the feature fusion weight output at the current moment is used for fusing the feature response graph of the next frame;
and based on the final target characteristic response diagram, obtaining the scale of the current target by utilizing multi-scale search, and realizing visual tracking.
In one embodiment, the adaptive time-sequence smoothing correlation filtering module obtains the degree of variation of the current target and scene by using the overall quality difference and confidence difference of the target characteristic response image between adjacent video frames as the basis;
and designing a time sequence smooth punishment for self-adaptive adjustment of the intensity by utilizing the obtained variation degree of the target and the scene, and using the time sequence smooth punishment for reliably updating the self-adaptive correlation filtering model.
In another embodiment, the time-series smoothing penalty for adaptively adjusting the magnitude of the intensity is specifically:
Figure BDA0003097797410000121
wherein the function e(.)The first term | R ofmax,t-1-RmaxL represents the maximum confidence interval between adjacent frames; r | |t-1-R||2Representing the difference in overall response quality between adjacent frames; gamma is a hyperparameter; r ═ betat⊙RD+(I-βt)⊙RSRepresenting the fusion of weights beta by adaptationtIntegrated characteristic response, and betatThe self-adaptive feature response fusion learning module learns the feature; rt-1Representing the response fusion of the previous frame; rmax,t-1And RmaxRepresenting the maximum confidence of the previous and current frames.
During specific implementation, the self-adaptive feature response fusion learning module introduces the features of the directional gradient histogram and the convolutional neural network to integrate feature response; the desired correlation filter is solved by two independent correlation filtering training processes.
In one embodiment, the adaptive feature response fusion learning module learns optimal weights for histogram of oriented gradients and convolutional neural network feature response map fusion using gaussian response truth values as a supervised ridge regression method.
Wherein the optimal weight is represented as:
Figure BDA0003097797410000122
wherein, betatThe optimal characteristic response fusion matrix is represented;
Figure BDA0003097797410000123
is betatProvides a priori information and can avoid model degradation; μ is a hyperparameter.
It should be noted that the device description in the above embodiments corresponds to the method description in the embodiments, and the embodiments of the present invention are not described herein again.
The execution main bodies of the processor 1 and the memory 2 may be devices having a calculation function, such as a computer, a single chip, a microcontroller, and the like, and in the specific implementation, the execution main bodies are not limited in the embodiment of the present invention, and are selected according to requirements in practical applications.
The memory 2 and the processor 1 transmit data signals through the bus 3, which is not described in detail in the embodiment of the present invention.
Based on the same inventive concept, an embodiment of the present invention further provides a computer-readable storage medium, where the storage medium includes a stored program, and when the program runs, the apparatus on which the storage medium is located is controlled to execute the method steps in the foregoing embodiments.
The computer readable storage medium includes, but is not limited to, flash memory, hard disk, solid state disk, and the like.
It should be noted that the descriptions of the readable storage medium in the above embodiments correspond to the descriptions of the method in the embodiments, and the descriptions of the embodiments of the present invention are not repeated here.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions according to the embodiments of the invention are brought about in whole or in part when the computer program instructions are loaded and executed on a computer.
The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on or transmitted over a computer-readable storage medium. The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium or a semiconductor medium, etc.
Reference documents:
[1]Li,F.,Tian,C.,Zuo,W.,Zhang,L.,Yang,M.:Learning Spatial-Temporal Regularized Correlation Filters for Visual Tracking[C].In:CVPR,2018,pp.4904–4913.
[2]Gonzalez,R.C.and Woods,R.E.(2002)Digital Image processing[M].2nd Edition,Prentice Hall,New York.
[3]Bolme,D.S.,Beveridge,J.R.,Draper,B.A.,Lui,Y.M.:Visual object tracking using adaptive correlation filters[C].In:IEEE Conference on Computer Vision and Pattern Recognition,2010, pp.2544–2550.
in the embodiment of the present invention, except for the specific description of the model of each device, the model of other devices is not limited, as long as the device can perform the above functions.
Those skilled in the art will appreciate that the drawings are only schematic illustrations of preferred embodiments, and the above-described embodiments of the present invention are merely provided for description and do not represent the merits of the embodiments.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (8)

1. A visual tracking method based on adaptive correlation filtering feature fusion learning is characterized by comprising the following steps:
extracting input tracking images into shallow and deep feature representations by adopting two feature extractors, finishing learning and updating of related filter coefficients by a related filtering module under the assistance of self-adaptive time sequence smoothing punishment, and performing related operation on a target feature map and the updated related filter to obtain a response map of shallow and deep features;
weighting and fusing the feature fusion weight output by the self-adaptive feature response fusion learning module at the previous moment and the response images of the shallow layer and the depth features together to form a final target feature response image, wherein the feature fusion weight output at the current moment is used for fusing the feature response images of the next frame;
and based on the final target characteristic response diagram, obtaining the scale of the current target by utilizing multi-scale search, and realizing visual tracking.
2. The visual tracking method based on the adaptive correlation filtering feature fusion learning according to claim 1,
the self-adaptive time sequence smoothing related filtering module adopts the whole quality difference and the confidence coefficient difference of a target characteristic response image between adjacent video frames as a basis to obtain the variation degree of a current target and a scene;
and designing a time sequence smooth punishment for self-adaptive adjustment of the intensity by utilizing the obtained variation degree of the target and the scene, and using the time sequence smooth punishment for reliably updating the self-adaptive correlation filtering model.
3. The visual tracking method based on adaptive correlation filtering feature fusion learning according to claim 1, wherein the time-series smooth penalty for adaptively adjusting the intensity is specifically:
Figure FDA0003097797400000011
wherein the function e(.)The first term | R ofmax,t-1-RmaxL represents the maximum confidence interval between adjacent frames; r | |t-1-R||2Representing the difference in overall response quality between adjacent frames; gamma is a hyperparameter; r ═ betat⊙RD+(I-βt)⊙RSRepresenting the fusion of weights beta by adaptationtIntegrated characteristic response, and betatThe self-adaptive feature response fusion learning module learns the feature; rt-1Representing the response fusion of the previous frame; rmax,t-1And RmaxRepresenting the maximum confidence of the previous and current frames.
4. The visual tracking method based on the adaptive correlation filtering feature fusion learning of claim 1, wherein the adaptive feature response fusion learning module introduces histogram of oriented gradients and convolutional neural network features to integrate feature responses; the desired correlation filter is solved by two independent correlation filtering training processes.
5. The visual tracking method based on adaptive correlation filtering feature fusion learning as claimed in claim 1, wherein the adaptive feature response fusion learning module learns the optimal weights for histogram of oriented gradients and convolutional neural network feature response graph fusion by using gaussian response truth value as supervised ridge regression method.
6. The visual tracking method based on adaptive correlation filtering feature fusion learning according to claim 5,
the optimal weight is expressed as:
Figure FDA0003097797400000021
wherein, betatThe optimal characteristic response fusion matrix is represented;
Figure FDA0003097797400000022
is betatProvides a priori information and can avoid model degradation; μ is a hyperparameter.
7. An apparatus for visual tracking based on adaptive correlation filter feature fusion learning, the apparatus comprising: a processor and a memory, the memory having stored therein program instructions, the processor calling upon the program instructions stored in the memory to cause the apparatus to perform the method steps of any of claims 1-6.
8. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program comprising program instructions which, when executed by a processor, cause the processor to carry out the method steps of any of claims 1-6.
CN202110615306.8A 2021-06-02 2021-06-02 Visual tracking method and device based on adaptive correlation filtering feature fusion learning Active CN113538509B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110615306.8A CN113538509B (en) 2021-06-02 2021-06-02 Visual tracking method and device based on adaptive correlation filtering feature fusion learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110615306.8A CN113538509B (en) 2021-06-02 2021-06-02 Visual tracking method and device based on adaptive correlation filtering feature fusion learning

Publications (2)

Publication Number Publication Date
CN113538509A true CN113538509A (en) 2021-10-22
CN113538509B CN113538509B (en) 2022-09-27

Family

ID=78095088

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110615306.8A Active CN113538509B (en) 2021-06-02 2021-06-02 Visual tracking method and device based on adaptive correlation filtering feature fusion learning

Country Status (1)

Country Link
CN (1) CN113538509B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114419102A (en) * 2022-01-25 2022-04-29 江南大学 Multi-target tracking detection method based on frame difference time sequence motion information

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109816689A (en) * 2018-12-18 2019-05-28 昆明理工大学 A kind of motion target tracking method that multilayer convolution feature adaptively merges
CN110348492A (en) * 2019-06-24 2019-10-18 昆明理工大学 A kind of correlation filtering method for tracking target based on contextual information and multiple features fusion
CN110738684A (en) * 2019-09-12 2020-01-31 昆明理工大学 target tracking method based on correlation filtering fusion convolution residual learning
CN111080675A (en) * 2019-12-20 2020-04-28 电子科技大学 Target tracking method based on space-time constraint correlation filtering
CN111401178A (en) * 2020-03-09 2020-07-10 蔡晓刚 Video target real-time tracking method and system based on depth feature fusion and adaptive correlation filtering
CN111612817A (en) * 2020-05-07 2020-09-01 桂林电子科技大学 Target tracking method based on depth feature adaptive fusion and context information
CN111723632A (en) * 2019-11-08 2020-09-29 珠海达伽马科技有限公司 Ship tracking method and system based on twin network
US10861170B1 (en) * 2018-11-30 2020-12-08 Snap Inc. Efficient human pose tracking in videos
CN112163020A (en) * 2020-09-30 2021-01-01 上海交通大学 Multi-dimensional time series anomaly detection method and system
CN112329784A (en) * 2020-11-23 2021-02-05 桂林电子科技大学 Correlation filtering tracking method based on space-time perception and multimodal response

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10861170B1 (en) * 2018-11-30 2020-12-08 Snap Inc. Efficient human pose tracking in videos
CN109816689A (en) * 2018-12-18 2019-05-28 昆明理工大学 A kind of motion target tracking method that multilayer convolution feature adaptively merges
CN110348492A (en) * 2019-06-24 2019-10-18 昆明理工大学 A kind of correlation filtering method for tracking target based on contextual information and multiple features fusion
CN110738684A (en) * 2019-09-12 2020-01-31 昆明理工大学 target tracking method based on correlation filtering fusion convolution residual learning
CN111723632A (en) * 2019-11-08 2020-09-29 珠海达伽马科技有限公司 Ship tracking method and system based on twin network
CN111080675A (en) * 2019-12-20 2020-04-28 电子科技大学 Target tracking method based on space-time constraint correlation filtering
CN111401178A (en) * 2020-03-09 2020-07-10 蔡晓刚 Video target real-time tracking method and system based on depth feature fusion and adaptive correlation filtering
CN111612817A (en) * 2020-05-07 2020-09-01 桂林电子科技大学 Target tracking method based on depth feature adaptive fusion and context information
CN112163020A (en) * 2020-09-30 2021-01-01 上海交通大学 Multi-dimensional time series anomaly detection method and system
CN112329784A (en) * 2020-11-23 2021-02-05 桂林电子科技大学 Correlation filtering tracking method based on space-time perception and multimodal response

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
DAQUN LI ET AL.: "Siamese Visual Tracking With Deep Features and Robust Feature Fusion", 《IEEE ACCESS》 *
DEEPAYAN CHAKRABARTI ET AL: "Evolutionary Clustering", 《ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY 20060820-23;PHILADELPHIA,PA(US)》 *
袁越: "结合相关滤波与深度网络的多尺度目标跟踪", 《中国优秀博硕士学位论文全文数据库(硕士) 信息科技辑》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114419102A (en) * 2022-01-25 2022-04-29 江南大学 Multi-target tracking detection method based on frame difference time sequence motion information
CN114419102B (en) * 2022-01-25 2023-06-06 江南大学 Multi-target tracking detection method based on frame difference time sequence motion information

Also Published As

Publication number Publication date
CN113538509B (en) 2022-09-27

Similar Documents

Publication Publication Date Title
CN111354017B (en) Target tracking method based on twin neural network and parallel attention module
CN111291679A (en) Target specific response attention target tracking method based on twin network
CN109859241B (en) Adaptive feature selection and time consistency robust correlation filtering visual tracking method
CN111260688A (en) Twin double-path target tracking method
CN111582349B (en) Improved target tracking algorithm based on YOLOv3 and kernel correlation filtering
CN110472577B (en) Long-term video tracking method based on adaptive correlation filtering
CN109087337B (en) Long-time target tracking method and system based on hierarchical convolution characteristics
CN112446900B (en) Twin neural network target tracking method and system
CN111583294B (en) Target tracking method combining scale self-adaption and model updating
CN109727272B (en) Target tracking method based on double-branch space-time regularization correlation filter
Zhang et al. A background-aware correlation filter with adaptive saliency-aware regularization for visual tracking
CN110992401A (en) Target tracking method and device, computer equipment and storage medium
CN111968153A (en) Long-time target tracking method and system based on correlation filtering and particle filtering
CN112329784A (en) Correlation filtering tracking method based on space-time perception and multimodal response
He et al. Towards robust visual tracking for unmanned aerial vehicle with tri-attentional correlation filters
Fu et al. DR 2 track: towards real-time visual tracking for UAV via distractor repressed dynamic regression
Fu et al. Robust multi-kernelized correlators for UAV tracking with adaptive context analysis and dynamic weighted filters
CN113538509B (en) Visual tracking method and device based on adaptive correlation filtering feature fusion learning
CN109146928B (en) Target tracking method for updating gradient threshold judgment model
CN108257148B (en) Target suggestion window generation method of specific object and application of target suggestion window generation method in target tracking
CN112884799A (en) Target tracking method in complex scene based on twin neural network
CN112614161A (en) Three-dimensional object tracking method based on edge confidence
CN112767440A (en) Target tracking method based on SIAM-FC network
CN111539985A (en) Self-adaptive moving target tracking method fusing multiple features
CN115984325A (en) Target tracking method for target volume searching space-time regularization

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant